user1590561
user1590561

Reputation: 613

How to write large Pyspark DataFrame to DynamoDB

I have pyspark data frame with 3+ million of records and it's necessary to write it to Dynamo db. What is the best way to do it?

Upvotes: 1

Views: 703

Answers (1)

Shubham Jain
Shubham Jain

Reputation: 5536

If you want to do this using python then you can do this as:

  • save the spark df with sufficient number of files i.e. if file size is 5 GB, generate 50 files of 100 mb.
  • Now write python code with multiprocessing where your process pool will be equal to the number of CPU's available.
  • Write the files using dynamodb's boto3 batch_writer and process all files parallely.

For this you can use either glue python shell or create your own container and launch it on fargate.

Upvotes: 2

Related Questions