Reputation: 19468

Process work in parallel with non-threadsafe function in scala

I have a lot of work (thousands of jobs) for a Scala application to process. Each piece of work is the file name of a 100 MB file. To process each file, I need to use an extractor object that is not thread safe (I can have multiple copies, but copies are expensive, and I should not make one per job). What is the best way to complete this work in parallel in Scala?

Upvotes: 1

Answers (5)

Jonny Coombes

Reputation: 575

First question: how quick does the work need to be completed?

Second question: would this work be isolated to a single physical box or what are your upper bounds on computational resource.

Third question: does the work that needs doing to each individual "job" require blocking and is it serialised or could be partitioned into parallel packets of work?

Maybe think about a distributed model whereby you scale through designing with a mind to pushing out across multiple nodes from the first instance, actors, remoteref all that crap first...try and keep your logic simple and easy - so serialised. Don't just think in terms of a single box.

Most answers here seem to dwell on the intricacies of spawning thread pools and executors and all that stuff - which is fine, but be sure you have a handle on the real problem first, before you start complicating your life with lots of thinking around how you manage the synchronisation logic.

If a problem can be decomposed, then decompose it. Don't overcomplicate it for the sake of doing so - it leads to better engineered code and less sleepless nights.

Upvotes: 0

paradigmatic

Reputation: 40461

It depends: what's the relative amount of CPU consumed by the extractor for each job ?

If it is very small, you have a classic single-producer/multiple-consumer problem for which you can find lots of solution in different languages. For Scala, if you are reluctant to start using actors, you can still use the Java API (Runnable, Executors and BlockingQueue, are quite good).
If it is a substantial amount (more than 10%), you app will never scale with a multithread model (see Amdhal law). You may prefer to run several process (several JVM) to obtain thread safety, and thus eliminate the non-sequential part.

Upvotes: 0

ziggystar

Reputation: 28680

Don't make 1000 jobs, but make 4x250 jobs (targeting 4 threads) and give one extractor to each batch. Inside each batch, work sequentially. This might not be optimal parallel-wise, since one batch might finish earlier but it is very easy to implement.

Probably the correct (but more complicated) solution would be to make a pool of extractors, where jobs take extractors from and put them back after finishing.

Upvotes: 1

Dan Simon

Reputation: 13137

You can wrap your extractor in an Actor and send each file name to the actor as a message. Since an instance of an actor will process only one message at a time, thread safety won't be an issue. If you want to use multiple extractors, just start multiple instances of the actor and balance between them (you could write another actor to act as a load balancer).

The extractor actor(s) can then send extracted files to other actors to do the rest of the processing in parallel.

Upvotes: 2

Phil Miller

Reputation: 38118

I would make a thread pool, where each thread has an instance of the extractor class, and instantiate just as many of these threads as it takes to saturate the system (based on CPU usage, IO bandwidth, memory bandwidth, network bandwidth, contention for other shared resources, etc.). Then use a thread-safe work queue that these threads can pull tasks from, process them, and iterate until the container is empty.

Mind you, there should be one or several libraries in just about any modern language that implements exactly this. In C++, it would be Intel's Threading Building Blocks. In Objective-C, it would be Grand Central Dispatch.

Upvotes: 0

Process work in parallel with non-threadsafe function in scala

Answers (5)

Related Questions