How to ETL very large csv from AWS S3 to Dynamo

Question

Looking for some tips here. I did a quiet a bit of coding and research using python3 and lambda. However, timeout is the biggest issue I am struggling with atm. I am trying to read a very large csv file (3GB) from S3 and push the rows into DynamoDB. I am currently reading about 1024 * 32 bytes at a time, then pushing the rows into dynamo DB (batch write with asyncio) using a pub/sub pattern, and it works great for small files, i.e. ~500K rows. It times out when I have millions of rows. I’m trying NOT to use AWS glue and/or EMR. I have some constraints/limitations with those.

Does anyone know if this can be done using Lambda or step functions? If so, could you please share your ideas? Thanks!!

Frosty · Accepted Answer

Besides lambda time constraint you might run into lambda memory constraint while you are reading file in AWS Lambda as lambda has just /tmp directory storage of 512 MB and that again depends on how you are reading the file in lambda.

If you don't want to go via AWS Glue or EMR, another thing you can do is by provisioning an EC2 and run your same code you are running in lambda from there. To make it cost effective, you can make EC2 transient i.e. provision it when you need to run S3 to DynamoDB job and shut it down once the job is completed. This transient nature can be achieved by Lambda function. You can also orchestrate the same with Step Functions also. Another option that you can look into is via AWS Datapipeline.

How to ETL very large csv from AWS S3 to Dynamo

Answers (1)

Related Questions