Reputation: 573
I'm working on an ETL pipeline using Apache NiFi, the flow runs hourly and is something like this:
I found GLUE to be expensive to send data to redshift, will code something ad-hoc to use the COPY command.
The question I would like to ask is if you can guide me if there is a better tool/way to do something better/cheaper/scalable, specially on steps 2 and 3.
I'm looking for ways, to optimize this process and make it ready to recieve millions of registries per hour.
Thank you!
Upvotes: 0
Views: 710
Reputation: 67
Read AWS GLUE https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console
Upvotes: 0
Reputation: 5536
You can save your data in partitioned fashion in s3. Then use glue spark jobs to transform the data and implementing joins and aggregations as that will be fast if written in optimized way.
This will also save you cost as glue will process the data faster then expected and then to move data to redshift copy command is the best approach.
Upvotes: 1
Reputation: 41
Interesting workflow.
You can actually use some neat combinations to automatically get data from s3 into redshift.
You can do S3 (Raw Data) -> Lambda (Off PUT notification) -> Kinesis Firehose -> S3 (batched & transformed with firehose transformer) -> Kinesis Redshift Copy
This flow will completely automate updates based on your data. You can read more about it here. Hope this helps.
Upvotes: 1