Reputation: 1245
I am using AWS Glue jobs to backup dynamodb tables in s3 in parquet format to be able to use it in Athena.
If I want to use these parquet format s3 files to be able to do restore of the table in dynamodb, this is what I am thinking - read each parquet file and convert it into json and then insert the json formatted data into dynamodb (using pyspark on the below lines)
# set sql context
parquetFile = sqlContext.read.parquet(input_file)
parquetFile.write.json(output_path)
Convert normal json to dynamo expected json using - https://github.com/Alonreznik/dynamodb-json
Does this approach sound right? Are there any other alternatives to this approach?
Upvotes: 3
Views: 5290
Reputation: 4710
You can use AWS Glue to directly convert Parquet format into JSON, then create a lambda function that triggers on S3 put and load into DyanmoDB
https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Upvotes: 2
Reputation: 1962
Your approach will work, but you can write directly to DynamoDB. You just need to import a few jar
s when you run pyspark
. Have a look at this:
https://github.com/audienceproject/spark-dynamodb
Hope this helps.
Upvotes: 0