Execute Powershell Script Frok Rds to Read S3

By: | Updated: 2020-11-04 | Comments (4) | Related: > Amazon AWS

Problem

Nosotros have a large number of data files stored in Amazon S3 that nosotros would similar to import into an RDS SQL Server database. We know we can apply ETL tools for this, but is in that location any way to import this data without ETL?

Solution

Amazon Web Service (AWS) recently announced a new characteristic of its Relational Database Service (RDS) for SQL Server. This feature allows a native integration between Amazon RDS SQL Server and Amazon S3. With this integration, information technology's now possible to import files from an Amazon S3 bucket into a local folder of the RDS instance. Similarly, files from that binder tin be exported to S3. The RDS local folder path is D:\S3\.

This integration is a neat way to import data into RDS SQL Server databases without using an ETL tool. S3 data files downloaded to the local drive can be easily ingested using the Bulk INSERT command.

Test Scenario

In this article, nosotros will see how to set an RDS SQL Server for native S3 integration. Although the feature is available for both single-AZ and multi-AZ scenarios, we will use a single-AZ RDS case. We volition re-create some sample information files from an S3 saucepan to the RDS case and so ingest that data into a database.

Source Information

The data files for the examples in this article are from the City of Los Angeles Open Data Portal. There are many publicly available datasets on this site, some of which are from the Los Angeles Police Department (LAPD). One such dataset is the yearly LAPD Calls for Service. These datasets incorporate anonymized details of service requests made to the LAPD each year. The files are in CSV format and have header rows.

The image below shows a sample of the data:

A sample of the LAPD Calls for Service dataset

As y'all can run into, there are seven columns in each dataset which are more-or-less self-explanatory:

Incident Number - Unique number that identifies the call
Reporting District - Numerical representation of the place where the call originated from
Area Occurred - Place where the incident is reported to have occurred
Dispatch Date - Date the call was dispatched to an officer of the LAPD
Acceleration Time - Time the phone call was dispatched to an officer of the LAPD
Call Blazon Code - Alphanumeric code for the service request blazon
Call Type Clarification - Short description of the service request blazon

S3 Source Location

Nosotros have copied the files under a folder called "data" in an S3 saucepan:

A folder in the source S3 bucet

Data files under a folder in the source S3 bucket

Observe how the files follow a naming manner. Each file has a prefix "LAPD_Calls_for_Service_", followed past the year. Every file has an .csv extension. Here, we have files betwixt 2010 and 2019. We volition use these naming style in a script later.

RDS SQL Server Instance

We created a single-AZ RDS SQL Server 2017 example in the same region as the S3 bucket (us-eastward-i, Due north Virginia):

RDS SQL Server instance

Destination Database and Table

We created a database called "s3_integration", and also created a table chosen "lapd_service_calls" in it. The table's fields correspond the source file columns.

-- Create destination database  CREATE DATABASE [s3_integration]  CONTAINMENT = NONE  ON PRIMARY ( Proper noun = N's3_integration', FILENAME = Northward'D:\rdsdbdata\DATA\s3_integration.mdf' , SIZE = 102400KB , FILEGROWTH = 131072KB )  LOG ON ( Proper name = N's3_integration_log', FILENAME = N'D:\rdsdbdata\Data\s3_integration_log.ldf' , SIZE = 25600KB , FILEGROWTH = 32768KB )  Go   USE s3_integration  GO   -- Create destination table  CREATE Tabular array lapd_service_calls  (    incident_number varchar(20) Not NULL,    reporting_district varchar(xx) NOT Naught,    area_occurred varchar(20) Non Cypher,    dispatch_date varchar(30) Not Naught,    dispatch_time varchar(twenty) NOT Cipher,    call_type_code varchar(20) Non NULL,    call_type_description varchar(30) Null    )  GO

Although some of the fields in the source data are numeric (Incident Number) or date (Acceleration Date) blazon, we are specifying each field as varchar to keep the loading process uncomplicated.

Configuring S3 Integration

Nosotros volition at present configure the RDS instance for S3 integration.

Step i: Creating IAM Policy and Office

The first footstep for RDS SQL Server integration to S3 is to create an IAM role with access to the source S3 saucepan and folders. In the images beneath, we are creating an IAM policy with necessary admission privileges:

Creating an IAM policy

We are calling this policy "sqlserver-s3-integration". Next, nosotros are assigning S3 permissions to the policy:

Assigning permissions to the IAM policy

Finally, nosotros are specifying the S3 resource on which these permissions will utilize. In this case, we are providing the ARN (Amazon Resources Name) of the source S3 bucket. Note how we are selecting the selection to include all objects (including folders) under this bucket:

Specifying resources for IAM policy permissions

Once the policy is created, we have to assign information technology to an AIM role. In the images below, we are creating an IAM function with the aforementioned name as the policy, and assigning the policy to it:

Creating an IAM role

Selecting a use case for the IAM role

Assigning the pre-existing IAM policy to the IAM role

Saving the IAM role with a name and description

Footstep 2: Add IAM Function to RDS SQL Server Case

The next stride is to attach the IAM role to the RDS instance. This can exist washed from the "Connectivity and security" tab of the RDS instance's belongings. Under the "Manage IAM roles" section, we tin can select the role but created and so select "S3_INTEGRATION" from the "features" drib-downwards list:

Assigning an IAM role to an RDS instance

Once the alter takes effect, the part has "Active" status:

An active IAM role for the RDS instance

Importing Data from S3

Importing the data from S3 into a database table involves ii steps:

Download the S3 files to a local binder (D:\S3\) in RDS
Apply the Bulk INSERT command to import the local data files into the table

Stride 1: Download S3 Files

Amazon RDS for SQL Server comes with several custom stored procedures and functions. These are located in the msdb database. The stored procedure to download files from S3 is "rds_download_from_s3". The syntax for this stored procedure is shown here:

exec msdb.dbo.rds_download_from_s3       @s3_arn_of_file='arn:aws:s3:::<bucket_name>/<file_name>',       @rds_file_path='D:\S3\<custom_folder_name>\<file_name>',       @overwrite_file=ane;

The first parameter specifies the ARN of an S3 file.

The 2nd parameter sets the destination path in the RDS instance. The root path is D:\S3\, but we can specify a custom folder proper name nether information technology. The folder volition be created if it doesn't already be.

The third parameter is a flag. It indicates if an already downloaded source file will be deleted when the procedure runs again.

Please notation that you cannot specify wildcards to copy multiple files. The part expects individual file names.

Instead of running this command multiple times, we are using a small script to automate the process:

declare @year_start smallint declare @year_end smallint  declare @filename_prefix varchar(50) declare @filename varchar(l)  declare @filename_suffix varchar(x) declare @s3path_root varchar(200) declare @s3path varchar(200)  declare @local_path_root varchar(20) declare @local_path varchar(300) declare @sql nvarchar(1000)    set up @year_start=2010  set up @year_end=2019  prepare @filename_prefix = 'LAPD_Calls_for_Service_' prepare @filename_suffix = '.csv' prepare @s3path_root = 'arn:aws:s3:::sqlserver-s3-integration/data/'  set @local_path_root = 'D:\S3\source_data\'   -- Loop through the file names and dynamically build a query to import the files while (@year_start <= @year_end)  begin     set @filename = @filename_prefix + convert(varchar(four), @year_start) + @filename_suffix    gear up @s3path = @s3path_root + @filename   set @sql = 'exec msdb.dbo.rds_download_from_s3 '   set @sql = @sql + '@s3_arn_of_file=' +  '''' + @s3path + ''', '   set @sql = @sql +  '@rds_file_path=' + '''' + @local_path_root + @filename + ''', '   set @sql = @sql + '@overwrite_file=1;'   exec(@sql)    gear up @year_start =  @year_start +ane  end

This script loops through the numbers 2010 and 2019, and dynamically builds each source filename inside each loop. As we saw before, the source files have a mutual naming blueprint, which makes this process easy. A dynamic SQL command is then generated with the file name to call the rds_download_from_s3 stored procedure. Note how we are using a custom folder ("source_data") to salve our files.

With each iteration, a download task is added to a queue for the RDS instance. RDS for SQL Server processes the tasks in this queue sequentially. At any time, there can be two tasks running simultaneously.

Once the script finishes running, we can run another stored procedure to list the files copied to the local folder. This stored procedure is called "rds_gather_file_details". Running the stored procedure also adds a chore at the terminate of the queue:

exec msdb.dbo.rds_gather_file_details;

We can check the status of all the queued tasks by running the "rds_fn_task_status" function:

SELECT * FROM msdb.dbo.rds_fn_task_status(Zero,0);

The first parameter value is always Null. The 2nd parameter is the job ID. In this case, we want to know the condition of all tasks, so we specify 0.

The paradigm below shows the various job status. Note how some download tasks have succeeded, one is in progress, and some are yet to run. Too, the terminal task ("LIST_FILES_ON_DISK") was added when we ran the "rds_gather_file_details" stored procedure. This chore has an ID of 15.

Checking the S3 download tasks

When all files are successfully downloaded, and the "LIST_FILES_ON_DISK" task is also completed, we tin run another function to see the listing of locally downloaded files:

SELECT * FROM msdb.dbo.rds_fn_list_file_details(15);

The "rds_fn_list_file_details" function's parameter value is the task ID of the "LIST_FILES_ON_DISK" task. In this example, it was 15. The image below shows the result:

List of locally downloaded files.

Footstep 2: BULK INSERT the Data

We can run a serial of Majority INSERT commands to load the data into the table nosotros created. The lawmaking snippet beneath shows the load process:

declare @year_start smallint declare @year_end smallint  declare @local_filename_prefix varchar(l) declare @local_filename_suffix varchar(x) declare @local_filename varchar(100) declare @tablename sysname  declare @sql nvarchar(m)    prepare @year_start=2010  set up @year_end=2019  set @local_filename_prefix = 'D:\S3\source_data\LAPD_Calls_for_Service_' fix @local_filename_suffix = '.csv' set @tablename = 's3_integration.dbo.lapd_service_calls'    while (@year_start <= @year_end)  brainstorm     gear up @local_filename = @local_filename_prefix + catechumen(varchar(4), @year_start) + @local_filename_suffix    set @sql = 'Majority INSERT ' + @tablename + ' FROM ''' + @local_filename + ''' WITH (DATAFILETYPE = ''char'', FIRSTROW = two, FIELDTERMINATOR = '','', ROWTERMINATOR = ''0x0a'')'    exec(@sql)    set up @year_start =  @year_start +1  end

In one case again, we are using the yr as a variable and looping through the values between 2010 and 2019 to build a dynamic SQL control. Once all the files are loaded, the output looks like this:

Output of the BULK INSERT command

Nosotros can at present run a query to encounter the total number of rows loaded:

Total number of rows loaded

We can besides run some analysis on the loaded information. In the lawmaking snippet below, we are finding the full number of calls made to LAPD every month about possible occurrences of domestic violence in the Hollywood neighborhood.

SELECT  RIGHT(dispatch_date, 4) AS Incident_Year,  CASE LEFT(dispatch_date,2)    WHEN '01' So 'January'   WHEN '02' THEN 'February'   WHEN '03' THEN 'March'   WHEN '04' And then 'April'   WHEN '05' THEN 'May'   WHEN '06' THEN 'June'   WHEN '07' THEN 'July'   WHEN '08' THEN 'August'   WHEN '09' THEN 'September'   WHEN '10' So 'October'   WHEN '11' THEN 'November'   WHEN '12' So 'December'   ELSE 'Unknown'   Terminate Every bit Incident_Month,   COUNT(*) AS Domestic_Violence_Calls  FROM s3_integration.dbo.lapd_service_calls  WHERE area_occurred Similar '%Hollywood%'   AND call_type_description LIKE '%DOM%' GROUP Past Right(dispatch_date, 4),  LEFT(dispatch_date,2)  Guild Past RIGHT(dispatch_date, four),  LEFT(dispatch_date,2);

The results may evidence some interesting trends:

Data analysis output

A Few Points to Recollect

There are some limitations to RDS' integration with S3.

Commencement of all, both the RDS instance and the S3 bucket has to be in the aforementioned AWS region. Cross-region integration isn't supported at the fourth dimension of writing (2020). Similarly, the IAM role owner and the S3 bucket owner has to be the same IAM user. In other words, the S3 bucket cannot be in another AWS business relationship.

Secondly, just files without whatever extension or files with the following extensions are supported:

*.bcp
*.csv
*.dat
*.fmt
*.info
*.lst
*.tbl
*.txt
*.xml

Third, the maximum number of files immune for local download is 100. But this can be overcome past removing the files in one case they are imported into a database.

Fourth, the maximum file size allowed for download is l GB.

Fifth, the S3 integration feature isn't automatically enabled in a restored RDS instance. It has to exist reconfigured once an instance snapshot is restored.

Sixth, the files in the D:\S3 folder are lost when an RDS example fails over to its standby replica node in a multi-AZ setup. The S3 integration needs to exist reenabled in the new node.

Other Functionality

There are other RDS-specific stored procedures for S3 integration:

rds_delete_from_filesystem is used to delete files from the RDS local binder
rds_cancel_task is used to abolish a task from the queue
rds_upload_to_s3 is used to upload files from the RDS instance to S3

Next Steps

This article showed how to use RDS SQL Server'south S3 integration feature. Readers are encouraged to refer to AWS documentation for more than details. Besides, it will be worthwhile to build an ETL framework based on this feature and develop

Data output stored procedures to upload query results from RDS to S3
Information loading stored procedures that can run from SQL Amanuensis jobs
Data cleaning stored procedures to regularly clean-upwards downloaded and ingested files. This stored process can use a status table to go on rails of files already migrated
A self-healing mechanism for failover in a multi-AZ scenario. With this characteristic, the RDS SQL Server instance will run a stored procedure to check the beingness of a specific folder or file under D:\S3. If the file or folder doesn't exist or the process fails, it will hateful a failover has happened, and the new node isn't configured for S3 integration. The procedure can then tape a message in SQL Server log.

Related Manufactures

Pop Articles

About the author

MSSQLTips author Sadequl Hussain Sadequl Hussain has been working with SQL Server since version 6.5 and his life as a DBA has seen him managing mission critical systems.

View all my tips

Article Final Updated: 2020-11-04

levywormuch.blogspot.com

Source: https://www.mssqltips.com/sqlservertip/6619/rds-sql-server-data-import-from-amazon-s3/