Reputation: 2633
I'm trying to decide whether to use AWS Glue or Amazon Data Pipeline for our ETL. I need to incrementally copy several tables to Redshift. Almost all tables need to be copied with no transformation. One table requires a transformation that could be done using Spark.
Based on my understanding from these two services, the best solution is to use a combination of the two. Data Pipeline can copy everything to S3. From there, if no transformation is needed, Data Pipeline can use Redshift COPY to move the data to Redshift. Where a transformation is required, a Glue job can apply the transformation and copy the data to Redshift.
Is this a sensible strategy or am I misunderstanding the applications of these services?
Upvotes: 1
Views: 1970
Reputation: 1978
I'm guessing it's long pass the project deadline but for people looking at this:
Use only AWS Glue. You can define Redshift as a both source and target connectors, meaning that you can read from it and dump into it. Before you do that, however, you'll need to use a a Crawler to create Glue-specific schema.
All of this can be also done through only Data Pipeline with SqlActivity
(s) although setting up everything might take significantly longer and not that much cheaper.
rant: I'm honestly surprised how AWS focused solely on big data solutions without providing a decent tool for small/medium/large data sets. Glue is an overkill and Data Pipeline is cumbersome/terrible for usage. There should be a simple SQL-type Lambda!
Upvotes: 2