Execute Powershell Script Frok Rds to Read S3
By: | Updated: 2020-11-04 | Comments (4) | Related: > Amazon AWS
Problem
Nosotros have a large number of data files stored in Amazon S3 that nosotros would similar to import into an RDS SQL Server database. We know we can apply ETL tools for this, but is in that location any way to import this data without ETL?
Solution
Amazon Web Service (AWS) recently announced a new characteristic of its Relational Database Service (RDS) for SQL Server. This feature allows a native integration between Amazon RDS SQL Server and Amazon S3. With this integration, information technology's now possible to import files from an Amazon S3 bucket into a local folder of the RDS instance. Similarly, files from that binder tin be exported to S3. The RDS local folder path is D:\S3\.
This integration is a neat way to import data into RDS SQL Server databases without using an ETL tool. S3 data files downloaded to the local drive can be easily ingested using the Bulk INSERT command.
Test Scenario
In this article, nosotros will see how to set an RDS SQL Server for native S3 integration. Although the feature is available for both single-AZ and multi-AZ scenarios, we will use a single-AZ RDS case. We volition re-create some sample information files from an S3 saucepan to the RDS case and so ingest that data into a database.
Source Information
The data files for the examples in this article are from the City of Los Angeles Open Data Portal. There are many publicly available datasets on this site, some of which are from the Los Angeles Police Department (LAPD). One such dataset is the yearly LAPD Calls for Service. These datasets incorporate anonymized details of service requests made to the LAPD each year. The files are in CSV format and have header rows.
The image below shows a sample of the data:
As y'all can run into, there are seven columns in each dataset which are more-or-less self-explanatory:
- Incident Number - Unique number that identifies the call
- Reporting District - Numerical representation of the place where the call originated from
- Area Occurred - Place where the incident is reported to have occurred
- Dispatch Date - Date the call was dispatched to an officer of the LAPD
- Acceleration Time - Time the phone call was dispatched to an officer of the LAPD
- Call Blazon Code - Alphanumeric code for the service request blazon
- Call Type Clarification - Short description of the service request blazon
S3 Source Location
Nosotros have copied the files under a folder called "data" in an S3 saucepan:
Observe how the files follow a naming manner. Each file has a prefix "LAPD_Calls_for_Service_", followed past the year. Every file has an .csv extension. Here, we have files betwixt 2010 and 2019. We volition use these naming style in a script later.
RDS SQL Server Instance
We created a single-AZ RDS SQL Server 2017 example in the same region as the S3 bucket (us-eastward-i, Due north Virginia):
Destination Database and Table
We created a database called "s3_integration", and also created a table chosen "lapd_service_calls" in it. The table's fields correspond the source file columns.
-- Create destination database CREATE DATABASE [s3_integration] CONTAINMENT = NONE ON PRIMARY ( Proper noun = N's3_integration', FILENAME = Northward'D:\rdsdbdata\DATA\s3_integration.mdf' , SIZE = 102400KB , FILEGROWTH = 131072KB ) LOG ON ( Proper name = N's3_integration_log', FILENAME = N'D:\rdsdbdata\Data\s3_integration_log.ldf' , SIZE = 25600KB , FILEGROWTH = 32768KB ) Go USE s3_integration GO -- Create destination table CREATE Tabular array lapd_service_calls ( incident_number varchar(20) Not NULL, reporting_district varchar(xx) NOT Naught, area_occurred varchar(20) Non Cypher, dispatch_date varchar(30) Not Naught, dispatch_time varchar(twenty) NOT Cipher, call_type_code varchar(20) Non NULL, call_type_description varchar(30) Null ) GO
Although some of the fields in the source data are numeric (Incident Number) or date (Acceleration Date) blazon, we are specifying each field as varchar to keep the loading process uncomplicated.
Configuring S3 Integration
Nosotros volition at present configure the RDS instance for S3 integration.
Step i: Creating IAM Policy and Office
The first footstep for RDS SQL Server integration to S3 is to create an IAM role with access to the source S3 saucepan and folders. In the images beneath, we are creating an IAM policy with necessary admission privileges:
We are calling this policy "sqlserver-s3-integration". Next, nosotros are assigning S3 permissions to the policy:
Finally, nosotros are specifying the S3 resource on which these permissions will utilize. In this case, we are providing the ARN (Amazon Resources Name) of the source S3 bucket. Note how we are selecting the selection to include all objects (including folders) under this bucket:
Once the policy is created, we have to assign information technology to an AIM role. In the images below, we are creating an IAM function with the aforementioned name as the policy, and assigning the policy to it:
Footstep 2: Add IAM Function to RDS SQL Server Case
The next stride is to attach the IAM role to the RDS instance. This can exist washed from the "Connectivity and security" tab of the RDS instance's belongings. Under the "Manage IAM roles" section, we tin can select the role but created and so select "S3_INTEGRATION" from the "features" drib-downwards list:
Once the alter takes effect, the part has "Active" status:
Importing Data from S3
Importing the data from S3 into a database table involves ii steps:
- Download the S3 files to a local binder (D:\S3\) in RDS
- Apply the Bulk INSERT command to import the local data files into the table
Stride 1: Download S3 Files
Amazon RDS for SQL Server comes with several custom stored procedures and functions. These are located in the msdb database. The stored procedure to download files from S3 is "rds_download_from_s3". The syntax for this stored procedure is shown here:
exec msdb.dbo.rds_download_from_s3 @s3_arn_of_file='arn:aws:s3:::<bucket_name>/<file_name>', @rds_file_path='D:\S3\<custom_folder_name>\<file_name>', @overwrite_file=ane;
The first parameter specifies the ARN of an S3 file.
The 2nd parameter sets the destination path in the RDS instance. The root path is D:\S3\, but we can specify a custom folder proper name nether information technology. The folder volition be created if it doesn't already be.
The third parameter is a flag. It indicates if an already downloaded source file will be deleted when the procedure runs again.
Please notation that you cannot specify wildcards to copy multiple files. The part expects individual file names.
Instead of running this command multiple times, we are using a small script to automate the process:
declare @year_start smallint declare @year_end smallint declare @filename_prefix varchar(50) declare @filename varchar(l) declare @filename_suffix varchar(x) declare @s3path_root varchar(200) declare @s3path varchar(200) declare @local_path_root varchar(20) declare @local_path varchar(300) declare @sql nvarchar(1000) set up @year_start=2010 set up @year_end=2019 prepare @filename_prefix = 'LAPD_Calls_for_Service_' prepare @filename_suffix = '.csv' prepare @s3path_root = 'arn:aws:s3:::sqlserver-s3-integration/data/' set @local_path_root = 'D:\S3\source_data\' -- Loop through the file names and dynamically build a query to import the files while (@year_start <= @year_end) begin set @filename = @filename_prefix + convert(varchar(four), @year_start) + @filename_suffix gear up @s3path = @s3path_root + @filename set @sql = 'exec msdb.dbo.rds_download_from_s3 ' set @sql = @sql + '@s3_arn_of_file=' + '''' + @s3path + ''', ' set @sql = @sql + '@rds_file_path=' + '''' + @local_path_root + @filename + ''', ' set @sql = @sql + '@overwrite_file=1;' exec(@sql) gear up @year_start = @year_start +ane end
This script loops through the numbers 2010 and 2019, and dynamically builds each source filename inside each loop. As we saw before, the source files have a mutual naming blueprint, which makes this process easy. A dynamic SQL command is then generated with the file name to call the rds_download_from_s3 stored procedure. Note how we are using a custom folder ("source_data") to salve our files.
With each iteration, a download task is added to a queue for the RDS instance. RDS for SQL Server processes the tasks in this queue sequentially. At any time, there can be two tasks running simultaneously.
Once the script finishes running, we can run another stored procedure to list the files copied to the local folder. This stored procedure is called "rds_gather_file_details". Running the stored procedure also adds a chore at the terminate of the queue:
exec msdb.dbo.rds_gather_file_details;
We can check the status of all the queued tasks by running the "rds_fn_task_status" function:
SELECT * FROM msdb.dbo.rds_fn_task_status(Zero,0);
The first parameter value is always Null. The 2nd parameter is the job ID. In this case, we want to know the condition of all tasks, so we specify 0.
The paradigm below shows the various job status. Note how some download tasks have succeeded, one is in progress, and some are yet to run. Too, the terminal task ("LIST_FILES_ON_DISK") was added when we ran the "rds_gather_file_details" stored procedure. This chore has an ID of 15.
When all files are successfully downloaded, and the "LIST_FILES_ON_DISK" task is also completed, we tin run another function to see the listing of locally downloaded files:
SELECT * FROM msdb.dbo.rds_fn_list_file_details(15);
The "rds_fn_list_file_details" function's parameter value is the task ID of the "LIST_FILES_ON_DISK" task. In this example, it was 15. The image below shows the result:
Footstep 2: BULK INSERT the Data
We can run a serial of Majority INSERT commands to load the data into the table nosotros created. The lawmaking snippet beneath shows the load process:
declare @year_start smallint declare @year_end smallint declare @local_filename_prefix varchar(l) declare @local_filename_suffix varchar(x) declare @local_filename varchar(100) declare @tablename sysname declare @sql nvarchar(m) prepare @year_start=2010 set up @year_end=2019 set @local_filename_prefix = 'D:\S3\source_data\LAPD_Calls_for_Service_' fix @local_filename_suffix = '.csv' set @tablename = 's3_integration.dbo.lapd_service_calls' while (@year_start <= @year_end) brainstorm gear up @local_filename = @local_filename_prefix + catechumen(varchar(4), @year_start) + @local_filename_suffix set @sql = 'Majority INSERT ' + @tablename + ' FROM ''' + @local_filename + ''' WITH (DATAFILETYPE = ''char'', FIRSTROW = two, FIELDTERMINATOR = '','', ROWTERMINATOR = ''0x0a'')' exec(@sql) set up @year_start = @year_start +1 end
In one case again, we are using the yr as a variable and looping through the values between 2010 and 2019 to build a dynamic SQL control. Once all the files are loaded, the output looks like this:
Nosotros can at present run a query to encounter the total number of rows loaded:
We can besides run some analysis on the loaded information. In the lawmaking snippet below, we are finding the full number of calls made to LAPD every month about possible occurrences of domestic violence in the Hollywood neighborhood.
SELECT RIGHT(dispatch_date, 4) AS Incident_Year, CASE LEFT(dispatch_date,2) WHEN '01' So 'January' WHEN '02' THEN 'February' WHEN '03' THEN 'March' WHEN '04' And then 'April' WHEN '05' THEN 'May' WHEN '06' THEN 'June' WHEN '07' THEN 'July' WHEN '08' THEN 'August' WHEN '09' THEN 'September' WHEN '10' So 'October' WHEN '11' THEN 'November' WHEN '12' So 'December' ELSE 'Unknown' Terminate Every bit Incident_Month, COUNT(*) AS Domestic_Violence_Calls FROM s3_integration.dbo.lapd_service_calls WHERE area_occurred Similar '%Hollywood%' AND call_type_description LIKE '%DOM%' GROUP Past Right(dispatch_date, 4), LEFT(dispatch_date,2) Guild Past RIGHT(dispatch_date, four), LEFT(dispatch_date,2);
The results may evidence some interesting trends:
A Few Points to Recollect
There are some limitations to RDS' integration with S3.
Commencement of all, both the RDS instance and the S3 bucket has to be in the aforementioned AWS region. Cross-region integration isn't supported at the fourth dimension of writing (2020). Similarly, the IAM role owner and the S3 bucket owner has to be the same IAM user. In other words, the S3 bucket cannot be in another AWS business relationship.
Secondly, just files without whatever extension or files with the following extensions are supported:
- *.bcp
- *.csv
- *.dat
- *.fmt
- *.info
- *.lst
- *.tbl
- *.txt
- *.xml
Third, the maximum number of files immune for local download is 100. But this can be overcome past removing the files in one case they are imported into a database.
Fourth, the maximum file size allowed for download is l GB.
Fifth, the S3 integration feature isn't automatically enabled in a restored RDS instance. It has to exist reconfigured once an instance snapshot is restored.
Sixth, the files in the D:\S3 folder are lost when an RDS example fails over to its standby replica node in a multi-AZ setup. The S3 integration needs to exist reenabled in the new node.
Other Functionality
There are other RDS-specific stored procedures for S3 integration:
- rds_delete_from_filesystem is used to delete files from the RDS local binder
- rds_cancel_task is used to abolish a task from the queue
- rds_upload_to_s3 is used to upload files from the RDS instance to S3
Next Steps
This article showed how to use RDS SQL Server'south S3 integration feature. Readers are encouraged to refer to AWS documentation for more than details. Besides, it will be worthwhile to build an ETL framework based on this feature and develop
- Data output stored procedures to upload query results from RDS to S3
- Information loading stored procedures that can run from SQL Amanuensis jobs
- Data cleaning stored procedures to regularly clean-upwards downloaded and ingested files. This stored process can use a status table to go on rails of files already migrated
- A self-healing mechanism for failover in a multi-AZ scenario. With this characteristic, the RDS SQL Server instance will run a stored procedure to check the beingness of a specific folder or file under D:\S3. If the file or folder doesn't exist or the process fails, it will hateful a failover has happened, and the new node isn't configured for S3 integration. The procedure can then tape a message in SQL Server log.
Related Manufactures
Pop Articles
About the author
Sadequl Hussain has been working with SQL Server since version 6.5 and his life as a DBA has seen him managing mission critical systems.
View all my tips
Article Final Updated: 2020-11-04
Source: https://www.mssqltips.com/sqlservertip/6619/rds-sql-server-data-import-from-amazon-s3/
0 Response to "Execute Powershell Script Frok Rds to Read S3"
Publicar un comentario