Better/best approach to load huge CSV file into DynamoDb

Question

I have a huge .csv file on my local machine. I want to load that data in a DynamoDB (eu-west-1, Ireland). How would you do that?

My first approach was:
- Iterate the CSV file locally
- Send a row to AWS via a curl -X POST -d '' .../connector/mydata
- Process the previous call within a lambda and write in DynamoDB
I do not like that solution because:
- There are too many requests
- If I send data without the CSV header information I have to hardcode the lambda
- If I send data with the CSV header there is too much traffic
I was also considering putting the file in an S3 bucket and process it with a lambda, but the file is huge and the lambda's memory and time limits scare me.
I am also considering doing the job on an EC2 machine, but I lose reactivity (if I turn off the machine while not used) or I lose money (if I do not turn off the machine).
I was told that Kinesis may be a solution, but I am not convinced.

Please tell me what would be the best approach to get the huge CSV file in DynamoDB if you were me. I want to minimise the workload for a "second" upload.

I prefer using Node.js or R. Python may be acceptable as a last solution.

E.J. Brennan · Accepted Answer

If you want to do it the AWS way, then data pipelines may be the best approach:

Here is a tutorial that does a bit more than you need, but should get you started:

The first part of this tutorial explains how to define an AWS Data Pipeline pipeline to retrieve data from a tab-delimited file in Amazon S3 to populate a DynamoDB table, use a Hive script to define the necessary data transformation steps, and automatically create an Amazon EMR cluster to perform the work.

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html

Better/best approach to load huge CSV file into DynamoDb

Answers (2)

Related Questions