Sid
Sid

Reputation: 115

Read large mongodb data

I have a java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.

This batch application is run every 4 hours 6 times a day.

Data Specifications:

Currently I am using MongoTemplate and Morphia in order to access MongoDB. However I get an OOM exception when processing this data using the following :

List<MYClass> datalist = datasource.getCollection("mycollection").find().asList();

What is the best way to read this data and populate to Hadoop?

Upvotes: 7

Views: 5314

Answers (2)

Ori Dar
Ori Dar

Reputation: 19010

Your problem lies at the asList() call

This forces the driver to iterate through the entire cursor (80,000 docs few Gigs), keeping all in memory.

batchSize(someLimit) and Cursor.batch() won't help here as you traverse the whole cursor, no matter what batch size is.

Instead you can:

1) Iterate the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()

2) Read documents one at a time and feed the documents into a buffer (let's say a list)

3) For every 1000 documents (say) call Hadoop API, clear the buffer, then start again.

Upvotes: 4

Bit33
Bit33

Reputation: 358

The asList() call will try to load the whole Mongodb collection into memory. Trying to make an in memory list object bigger than 3gb of size.

Iterating the collection with a cursor will fix this problem. You can do this with the Datasource class, but I prefer the type safe abstractions that Morphia offers with the DAO classes:

  class Dao extends BasicDAO<Order, String> {
    Dao(Datastore ds) {
      super(Order.class, ds);
    }
  }

  Datastore ds = morphia.createDatastore(mongoClient, DB_NAME);
  Dao dao = new Dao(ds);

  Iterator<> iterator = dao.find().fetch();
  while (iterator.hasNext()) {
      Order order = iterator.next;
      hadoopStrategy.add(order);
  }

Upvotes: 1

Related Questions