Looking for guidance/alternatives on ETL process on AWS

I'm working on an ETL pipeline using Apache NiFi, the flow runs hourly and is something like this:

data provider API->Apache Nifi->S3 landing
->Athena Query to transform the data->S3 stage
->Athena Query to change field types and join with another data so it be ready for analysis->S3 trusted
->Glue->Redshift

I found GLUE to be expensive to send data to redshift, will code something ad-hoc to use the COPY command.

The question I would like to ask is if you can guide me if there is a better tool/way to do something better/cheaper/scalable, specially on steps 2 and 3.

I'm looking for ways, to optimize this process and make it ready to recieve millions of registries per hour.

Thank you!

Upvotes: 0

Answers (3)

Sandip Jadhav

Reputation: 67

Read AWS GLUE https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console

Upvotes: 0

Shubham Jain

Reputation: 5536

You can save your data in partitioned fashion in s3. Then use glue spark jobs to transform the data and implementing joins and aggregations as that will be fast if written in optimized way.

This will also save you cost as glue will process the data faster then expected and then to move data to redshift copy command is the best approach.

Upvotes: 1

awssimplified

Reputation: 41

Interesting workflow.

You can actually use some neat combinations to automatically get data from s3 into redshift.

You can do S3 (Raw Data) -> Lambda (Off PUT notification) -> Kinesis Firehose -> S3 (batched & transformed with firehose transformer) -> Kinesis Redshift Copy

This flow will completely automate updates based on your data. You can read more about it here. Hope this helps.

Upvotes: 1

Looking for guidance/alternatives on ETL process on AWS

Answers (3)

Related Questions