Leo Prince
Leo Prince

Reputation: 950

Mongo DB reading 4 million documents

We have Scheduled jobs that runs daily,This jobs looks for matching Documents for that day and takes the document and do minimal transform and sent it a queue for downstream processing. Typically we have 4 millions Documents to be processed for a day. Our aim is to complete the processing within one hour. I am looking for suggestions on the best practices to read 4 million Documents from MongoDB quickly ?

Upvotes: 1

Views: 780

Answers (1)

Alex Taylor
Alex Taylor

Reputation: 8813

The MongoDB Async driver is the first stop for low overhead querying. There's a good example of using the SingleResultCallback on that page:

Block<Document> printDocumentBlock = new Block<Document>() {
    @Override
    public void apply(final Document document) {
        System.out.println(document.toJson());
    }
};
SingleResultCallback<Void> callbackWhenFinished = new SingleResultCallback<Void>() {
    @Override
    public void onResult(final Void result, final Throwable t) {
        System.out.println("Operation Finished!");
    }
};

collection.find().forEach(printDocumentBlock, callbackWhenFinished);

It is a common pattern in asynchronous database drivers to allow results to be passed on for processing as soon as they are available. The use of OS-level async I/O will help with low CPU overhead. Which brings up the next problem - how to get the data out.

Without seeing the specifics of your work, you probably want to place the results into an in memory queue to be picked up by another thread at this point so the reader thread can keep reading results. An ArrayBlockingQueue is probably appropriate. put is more appropriate than add because it will block the reader thread if the worker(s) isn't able to keep up (keeping things balanced). Ideally, you don't want it to back up which is where multiple threads will be necessary. If the order of the results is important, use a single worker thread, otherwise use a ThreadPoolExecutor with the queue passed into the constructor. Using the in-memory queue does open up the possibility for data-loss if the results are being somehow discarded as they are read (i.e. if you were immediately sending off another query to delete them), and the reader process crashed.

At this point, either do the 'minimal transforms' on the worker thread(s), or serialize them in the workers and put them on a real queue (e.g. RabbitMQ, ZeroMQ). Putting them onto a real queue allows the work to be divided up amoungst multiple machines trivially, and provides optional persistence allowing recovery of work, and those queues have great clustering options for scalability. Those machines can then put the results into the queue you mentioned in the question (assuming it has the capacity).

The bottleneck in a system like that is how quickly one machine can get through a single mongo query, and how many results the final queue can handle. All the other parts (MongoDB, queues, # of worker machines) are individually scalable. By doing as little work as possible on the querying machine and pushing that work onto other machines that impact can be greatly reduced. It sounds like your destination queue is out of your control.

When trying to work out where bottlenecks are, measurements are critical. Adding metrics to your application up front will let you know which areas need improvement when things aren't going well.

That set-up can build a pretty scalable system. I've built many similar systems before. Beyond that, you'll want to investigate getting your data into something like Apache Storm.

Upvotes: 1

Related Questions