Scaling and selecting unique records per worker

Question

I'm part of a project where we are having to deal with a lot of data in a stream. It's going to be passed to Mongo and from there it needs to be processed by workers to see if it needs to be persisted, amongst other things, or discarded.

We want to scale this horizontally. My question is, what methods are there for ensuring that each worker selects a unique record, that isn't already being processed by another worker?

Is a central main worker required to hand out jobs to the sub workers, if that is the case, the bottle neck and point of failure is with that central worker, right?

Any ideas or suggestions welcome.

Thanks!

Josh

robertklep · Accepted Answer

You can use findAndModify to both select and flag a document atomically, making sure that only one worker gets to process it. My experience is that this can be slow due to excessive database locking, but that experience is based on MongoDB 2.x so it may not be an issue anymore on 3.x.

Also, with MongoDB it's difficult to "wait" for new jobs/documents (you can tail the oplog, but you'd have to do this from every worker and each one will wake up and perform the findAndModify() query, resulting in the aforementioned locking).

I think that ultimately you should consider using a proper messaging solution (write data to MongoDB, write the _id to the broker, have the workers subscribe to the message queue, and if you configure things properly only one worker will get a job). Well-known brokers are RabbitMQ, nsq.io and with a bit of extra work you can even use Redis.

Scaling and selecting unique records per worker

Answers (1)

Related Questions