gfcf14
gfcf14

Reputation: 350

OpenMP parallelization inside for loops takes too long

I am preparing a program which must use OpenMP parallelization. The program is supposed to compare two frames, inside of which both frames must be compared block by block, and OpenMP must be applied in two ways: one where frame work must be split across threads, and the other where the work must be split between the threads by a block level, finding the minimum cost of each comparison.

The main idea behind the skeleton of the code would look as follows:

int main() {
  // code
  for () {
    for () {
      searchBlocks();
    }
  }
  // code
}

searchBlocks() {
  for () {
    for () {
      getCost()
    }
  }
}

getCost() {
  for () {
    for () {
      // operations
    }
  }
}

Then, considering parallelization at a frame level, I can simply change the main nested loop to this

int main() {
  // code
  omp_set_num_threads(threadNo);

  #pragma omp parallel for collapse(2) if (isFrame)
  for () {
    for () {
      searchBlocks();
    }
  }
  // code
}

Where threadNo is specified upon running and isFrame is obtained via a parameter to specify if frame level parallelization is needed. This works and the execution time of the program becomes shorter as the number of threads used becomes bigger. However, as I try block level parallelization, I attempted the following:

getCost() {
  #pragma omp parallel for collapse(2) if (isFrame)
  for () {
    for () {
      // operations
    }
  }
}

I'm doing this in getCost() considering that it is the innermost function where the comparison of each corresponding block happens, but as I do this the program takes really long to execute, so much so that if I were to run it without OpenMP support (so 1 single thread) against OpenMP support with 10 threads, the former would finish first.

Is there something that I'm not declaring right here? I'm setting the number of threads right before the nested loops of the main function, just like I had in frame level parallelization.

Please let me know if I need to explain this better, or what it is I could change in order to manage to run this parallelization successfully, and thanks to anyone who may provide help.

Upvotes: 2

Views: 478

Answers (2)

Alexey S. Larionov
Alexey S. Larionov

Reputation: 7927

Remember that every time your program executes #pragma omp parallel directive, it spawns new threads. Spawning threads is very costly, and since you do getCost() many many times, and each call is not that computationally heavy, you end up wasting all the time on spawning and joining threads (which is essentially making costly system calls).

On the other hand, when #pragma omp for directive is executed, it doesn't spawn any threads, but it lets all the existing threads (which are spawned by previous parallel directive) to execute in parallel on separate pieces of data.

So what you want is to spawn threads on the top level of your computation by doing: (notice no for)

int main() {
  // code
  omp_set_num_threads(threadNo);

  #pragma omp parallel
  for () {
    for () {
      searchBlocks();
    }
  }
  // code
}

and then later to split loops by doing (notice no parallel)

getCost() {
  #pragma omp for collapse(2) if (isFrame)
  for () {
    for () {
      // operations
    }
  }
}

Upvotes: 3

Anton Anisimov
Anton Anisimov

Reputation: 66

You get cascading parallelization. Take the limit values in the main cycles as I,J, and in the getcost cycles as K,L: you get I * J * K * L threads. Here any operating system will go crazy. So not long before fork bomb to reach...

Well, and "collapse" is also not clear why. It's still two cycles inside, and I don't see much point in this parameter. Try removing parallelism in Getcost.

Upvotes: 0

Related Questions