Reputation: 19468
I have a lot of work (thousands of jobs) for a Scala application to process. Each piece of work is the file name of a 100 MB file. To process each file, I need to use an extractor object that is not thread safe (I can have multiple copies, but copies are expensive, and I should not make one per job). What is the best way to complete this work in parallel in Scala?
Upvotes: 1
Views: 436
Reputation: 575
First question: how quick does the work need to be completed?
Second question: would this work be isolated to a single physical box or what are your upper bounds on computational resource.
Third question: does the work that needs doing to each individual "job" require blocking and is it serialised or could be partitioned into parallel packets of work?
Maybe think about a distributed model whereby you scale through designing with a mind to pushing out across multiple nodes from the first instance, actors, remoteref all that crap first...try and keep your logic simple and easy - so serialised. Don't just think in terms of a single box.
Most answers here seem to dwell on the intricacies of spawning thread pools and executors and all that stuff - which is fine, but be sure you have a handle on the real problem first, before you start complicating your life with lots of thinking around how you manage the synchronisation logic.
If a problem can be decomposed, then decompose it. Don't overcomplicate it for the sake of doing so - it leads to better engineered code and less sleepless nights.
Upvotes: 0
Reputation: 40461
It depends: what's the relative amount of CPU consumed by the extractor for each job ?
If it is very small, you have a classic single-producer/multiple-consumer problem for which you can find lots of solution in different languages. For Scala, if you are reluctant to start using actors, you can still use the Java API (Runnable, Executors and BlockingQueue, are quite good).
If it is a substantial amount (more than 10%), you app will never scale with a multithread model (see Amdhal law). You may prefer to run several process (several JVM) to obtain thread safety, and thus eliminate the non-sequential part.
Upvotes: 0
Reputation: 28680
Don't make 1000 jobs, but make 4x250 jobs (targeting 4 threads) and give one extractor to each batch. Inside each batch, work sequentially. This might not be optimal parallel-wise, since one batch might finish earlier but it is very easy to implement.
Probably the correct (but more complicated) solution would be to make a pool of extractors, where jobs take extractors from and put them back after finishing.
Upvotes: 1
Reputation: 13137
You can wrap your extractor in an Actor and send each file name to the actor as a message. Since an instance of an actor will process only one message at a time, thread safety won't be an issue. If you want to use multiple extractors, just start multiple instances of the actor and balance between them (you could write another actor to act as a load balancer).
The extractor actor(s) can then send extracted files to other actors to do the rest of the processing in parallel.
Upvotes: 2
Reputation: 38118
I would make a thread pool, where each thread has an instance of the extractor class, and instantiate just as many of these threads as it takes to saturate the system (based on CPU usage, IO bandwidth, memory bandwidth, network bandwidth, contention for other shared resources, etc.). Then use a thread-safe work queue that these threads can pull tasks from, process them, and iterate until the container is empty.
Mind you, there should be one or several libraries in just about any modern language that implements exactly this. In C++, it would be Intel's Threading Building Blocks. In Objective-C, it would be Grand Central Dispatch.
Upvotes: 0