cameck
cameck

Reputation: 2098

AWS Data Pipeline, Best way to Structure Data in S3 for DynamoDB Mass Import?

I'm looking at migrating a massive database to Amazon's DynamoDB (think 150 million plus records).
I'm currently storing these records in Elasticsearch.

I'm reading up on Data Pipeline and you can import into DynamoDB from S3 using a TSV, CSV or JSON file.

It seems the best way to go is a JSON file and I've found two examples of how it should be structured:

So, my questions are the following:

I want to get this right the first time and not incur extra charges as apparently you get charged when you're right or wrong in your setup.

Any specific parts/links to the manual that I missed would also be greatly appreciated.

Upvotes: 2

Views: 1125

Answers (3)

Garet Jax
Garet Jax

Reputation: 1171

I am doing this exact thing right now. In fact I extracted 340 million rows using Data pipeline, transformed them using Lambda and am importing them right now using pipeline.

A couple of things:

1) JSON is a good way to go.

2) On the export, AWS limits each file to 100,000 records. Not sure if this is required or just a design decision.

3) In order to use the pipeline for import, there is a requirement to have a manifest file. This was news to me. I had an example from the export which you won't have. Without it your import probably won't work. It's structure is:

{"name":"DynamoDB-export","version":3,
"entries": [
{"url":"s3://[BUCKET_NAME]/2019-03-06-20-17-23/dd3906a0-a548-453f-96d7-ee492e396100-transformed","mandatory":true},
...
]}

4) Calorious' Blog has the format correct. I am not sure if the "S" needs to be lower case - mine all are. Here is an example row from my import file:

{"x_rotationRate":{"s":"-7.05723"},"x_acceleration":{"s":"-0.40001"},"altitude":{"s":"0.5900"},"z_rotationRate":{"s":"1.66556"},"time_stamp":{"n":"1532710597553"},"z_acceleration":{"s":"0.42711"},"y_rotationRate":{"s":"-0.58688"},"latitude":{"s":"37.3782895682606"},"x_quaternion":{"s":"-0.58124"},"x_user_accel":{"s":"0.23021"},"pressure":{"s":"101.0524"},"z_user_accel":{"s":"0.02382"},"cons_key":{"s":"index"},"z_quaternion":{"s":"-0.48528"},"heading_angle":{"s":"-1.000"},"y_user_accel":{"s":"-0.14591"},"w_quaternion":{"s":"0.65133"},"y_quaternion":{"s":"-0.04934"},"rotation_angle":{"s":"221.53970"},"longitude":{"s":"-122.080872377186"}}

Upvotes: 1

Mike Dinescu
Mike Dinescu

Reputation: 55760

Based on my experience I recommend JSON as the most reliable format, assuming of course, that the the JSON blobs you generate are properly formatted JSON objects (ie. proper escaping).

If you can generate valid JSON then go that route!

Upvotes: 0

Tolbahady
Tolbahady

Reputation: 591

I would enter few rows manually and export them using data-pipeline to see the exact format it generates; which will then be the same format you need to follow if you were to do imports (I think it's the first format in your examples).

Then I would set up a file with few rows (100 maybe) and run data-pipepline to ensure it works fine.

Breaking your file into chunks sounds good for me and it might help you recovering from a failure without having the start all over again.

Make sure you dont have keys with empty, null or undefined values. That will break and stop the import completely. When you're exporting entries from your current database you can omit keys with no values or set a default non-empty value them.

Upvotes: 0

Related Questions