Read large mongodb data

Question

I have a java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.

This batch application is run every 4 hours 6 times a day.

Data Specifications:

Documents: 80000 at a time (every 4 hours)
Size : 3gb

Currently I am using MongoTemplate and Morphia in order to access MongoDB. However I get an OOM exception when processing this data using the following :

List datalist = datasource.getCollection("mycollection").find().asList();

What is the best way to read this data and populate to Hadoop?

MongoTemplate::Stream() and write to Hadoop one by one?
batchSize(someLimit) and write the entire batch to Hadoop?
Cursor.batch() and write to hdfs one by one?

Ori Dar · Accepted Answer

Your problem lies at the asList() call

This forces the driver to iterate through the entire cursor (80,000 docs few Gigs), keeping all in memory.

batchSize(someLimit) and Cursor.batch() won't help here as you traverse the whole cursor, no matter what batch size is.

Instead you can:

1) Iterate the cursor: List datalist = datasource.getCollection("mycollection").find()

2) Read documents one at a time and feed the documents into a buffer (let's say a list)

3) For every 1000 documents (say) call Hadoop API, clear the buffer, then start again.

Read large mongodb data

Answers (2)

Related Questions