user3139990
user3139990

Reputation: 71

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.

I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.

There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)

The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.

Am I right to be disappointed?

Is there anything I can do to get closer to my theoretical expectation?

int split_bigtask(Inputs  * inputs, Outputs * results)
{
  for (int k = 0; k < MAXNO; k++)
    results->solved[k].value = 0;

  int res;
  #pragma omp parallel shared(inputs, outputs)
  {
    #pragma omp for schedule(dynamic)
    for (int k = 0; k < inputs->no; k++)
    {
      res = bigtask(inputs->values[k], 
                    outputs->solved[k], 
                    omp_get_thread_num()
                   );
    }
  }
  return TRUE;
}

Upvotes: 1

Views: 175

Answers (1)

Daniel
Daniel

Reputation: 1051

  1. I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
  2. It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
  3. you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).

I personally would assume that you have case 2 ("Big Global Array").

Solution to the problem (if it's indeed case 2):

  • Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
  • Split the global array into several smaller arrays (and give each thread one of these arrays)
  • Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors) You may want to try to create a local copy of the array for each thread (at least for the results)

Upvotes: 2

Related Questions