AWS Data Pipeline, Best way to Structure Data in S3 for DynamoDB Mass Import?

I'm looking at migrating a massive database to Amazon's DynamoDB (think 150 million plus records).
I'm currently storing these records in Elasticsearch.

I'm reading up on Data Pipeline and you can import into DynamoDB from S3 using a TSV, CSV or JSON file.

It seems the best way to go is a JSON file and I've found two examples of how it should be structured:

From AWS:

{"Name"ETX {"S":"Amazon DynamoDB"}STX"Category"ETX {"S":"Amazon Web Services"}}
{"Name"ETX {"S":"Amazon push"}STX"Category"ETX {"S":"Amazon Web Services"}}
{"Name"ETX {"S":"Amazon S3"}STX"Category"ETX {"S":"Amazon Web Services"}}

From Calorious' Blog:

{"Name": {"S":"Amazon DynamoDB"},"Category": {"S":"Amazon Web Services"}}
{"Name": {"S":"Amazon push"},"Category": {"S":"Amazon Web Services"}}
{"Name": {"S":"Amazon S3"},"Category": {"S":"Amazon Web Services"}}

So, my questions are the following:

Do I have to put a literal 'START of LINE (STX)'?
How reliable is this method? Should I be concerned about failed uploads? There doesn't seem to be a way to do error handling so I do I just assume that AWS got it right?
Is there an ideal size of file? For example should I break up the database into say 100K chunks of records and store each 100k chunk in one file?

I want to get this right the first time and not incur extra charges as apparently you get charged when you're right or wrong in your setup.

Any specific parts/links to the manual that I missed would also be greatly appreciated.

Upvotes: 2

Answers (3)

Garet Jax

Reputation: 1171

I am doing this exact thing right now. In fact I extracted 340 million rows using Data pipeline, transformed them using Lambda and am importing them right now using pipeline.

A couple of things:

1) JSON is a good way to go.

2) On the export, AWS limits each file to 100,000 records. Not sure if this is required or just a design decision.

3) In order to use the pipeline for import, there is a requirement to have a manifest file. This was news to me. I had an example from the export which you won't have. Without it your import probably won't work. It's structure is:

{"name":"DynamoDB-export","version":3,
"entries": [
{"url":"s3://[BUCKET_NAME]/2019-03-06-20-17-23/dd3906a0-a548-453f-96d7-ee492e396100-transformed","mandatory":true},
...
]}

4) Calorious' Blog has the format correct. I am not sure if the "S" needs to be lower case - mine all are. Here is an example row from my import file:

{"x_rotationRate":{"s":"-7.05723"},"x_acceleration":{"s":"-0.40001"},"altitude":{"s":"0.5900"},"z_rotationRate":{"s":"1.66556"},"time_stamp":{"n":"1532710597553"},"z_acceleration":{"s":"0.42711"},"y_rotationRate":{"s":"-0.58688"},"latitude":{"s":"37.3782895682606"},"x_quaternion":{"s":"-0.58124"},"x_user_accel":{"s":"0.23021"},"pressure":{"s":"101.0524"},"z_user_accel":{"s":"0.02382"},"cons_key":{"s":"index"},"z_quaternion":{"s":"-0.48528"},"heading_angle":{"s":"-1.000"},"y_user_accel":{"s":"-0.14591"},"w_quaternion":{"s":"0.65133"},"y_quaternion":{"s":"-0.04934"},"rotation_angle":{"s":"221.53970"},"longitude":{"s":"-122.080872377186"}}

Upvotes: 1

Mike Dinescu

Reputation: 55760

Based on my experience I recommend JSON as the most reliable format, assuming of course, that the the JSON blobs you generate are properly formatted JSON objects (ie. proper escaping).

If you can generate valid JSON then go that route!

Upvotes: 0

Tolbahady

Reputation: 591

I would enter few rows manually and export them using data-pipeline to see the exact format it generates; which will then be the same format you need to follow if you were to do imports (I think it's the first format in your examples).

Then I would set up a file with few rows (100 maybe) and run data-pipepline to ensure it works fine.

Breaking your file into chunks sounds good for me and it might help you recovering from a failure without having the start all over again.

Make sure you dont have keys with empty, null or undefined values. That will break and stop the import completely. When you're exporting entries from your current database you can omit keys with no values or set a default non-empty value them.

Upvotes: 0

AWS Data Pipeline, Best way to Structure Data in S3 for DynamoDB Mass Import?

From AWS:

Answers (3)

Related Questions