Eduardo
Eduardo

Reputation: 8392

How to split an array in even chunks for parallel processing

I have a large array of invoice data that is sorted by invoice date. I would like to summarize the data by some function of the date, for example: by month, quarter, etc. Since the array is very large, I would like to take advantage of multithreading, and process several portions of the array in parallel, in different threads.

Is am looking for an algorithm to split the array into N chunks, as similar in size as possible, with the condition that there will be no two elements of the array for which the date-transforming function yields the same result that end up in two different chunks.

So, for example, if I choose to partition it by month, all November 2014 invoices will be processed by the same thread.

I was just wondering if this was a known problem with a name, for which an algorithm already exists, which would save me from reinventing the wheel.

Upvotes: 1

Views: 2616

Answers (1)

Eric
Eric

Reputation: 19863

For your case (and ANY), multiple threads might or might not speed up your application. It depends on a gazillion of factors.

That being said, a common pattern for this is to have a bunch of worker threads in a pool.

Pseudo-code should go like this :

while your array is not empty 
  if there is at least one idle worker
    pick one idle thread and process element 0 from the array
    Remove element 0 from the array
    Process the element
  else wait for a worker thread to be idle

I would not bother splitting by month or such, because it leads to other problems where you have months without data or months with tons of data which skews the parallelism.

If you really want the sorting, then your worker thread can insert the processed data at the right place. Be careful however about synchronization. If you have a single thread running the above while, the you are correct, but if multiple threads try to insert into the "result array" for example, then you need to lock it before inserting

Edit based on your comments :

If you really want to avoid synchronizing and think that having all threads writing to the same array will work (some languages might not allow this, really unsure), then have all your data in one array, split the array in N equal chunks of M length. Create a result array with NxM buckets then when you have an idle worker start, assign it the Nth chunk and the Nth bucket.

At the end of the execution your data will be sorted and contiguous

Upvotes: 2

Related Questions