Nobody
Nobody

Reputation: 1

What is the best way to copy large csv files from s3 to redshift?

I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:

On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.

Here are the ways I found which can be used:

  1. Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.

  2. Use AWS Kinesis

  3. Use AWS Glue

I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?

Kinesis can handle large amount of data but is it the best way to do it?

I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.

If anyone can point me to the right direction, it will be really helpful.

Upvotes: 0

Views: 1413

Answers (1)

benfarr
benfarr

Reputation: 166

You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:

Trigger State Machine Execution from S3 Event

Manage Lambda Processing Executions and workflow state

Use S3 Select to pull subsets from large data objects

Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.

Upvotes: 1

Related Questions