Reputation: 8712
I have an s3 bucket, which gets almost 14-15 Billion records spread across 26000csv files, every day.
I need to parse these files and push it to mongo db.
Previously with just 50 to 100 million records, I was using bulk upsert with multiple parallel processes in an ec2 instance and it was fine. But since the number of records increased drastically, previous method is not that efficient.
So what will be the best method to do this?
Upvotes: 0
Views: 1045
Reputation: 1348
You should look at mongoimport which is written in GoLang and can make effective use of threadsto parallelize the uploading. It's pretty fast. you would have to copy the files from S3 to local disk prior to uploading but if you put the node in the same region as the S3 bucket and the database it should run quickly. Also, you could use MongoDB Atlas and its API to turn up the IOPS on your cluster while you were loading and dial it down afterwards to speed up uploading.
Upvotes: 2