jetset
jetset

Reputation: 892

AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.

Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.

We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.

Upvotes: 0

Views: 1203

Answers (2)

langton
langton

Reputation: 126

This sounds like an excellent use-case for Matillion ETL for Redshift.

(Full disclosure: I am the product manager for Matillion ETL for Redshift)

Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.

If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.

Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.

Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.

Upvotes: 3

Jon Scott
Jon Scott

Reputation: 4354

You can use AWS DMS for this.

Steps:

  1. set up and configure DMS instance
  2. set up target endpoint for redshift
  3. set up source endpoints for each sql server instance see https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
  4. set up a task for each sql server source, you can specify the tables to copy/synchronise and you can use a transformation to specify which schema name(s) on redshift you want to write to.

You will then have all of the data in identical schemas on redshift.

If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.

Upvotes: 0

Related Questions