Harsh Patel
Harsh Patel

Reputation: 1344

how can I do parallel reduction approach to combine the partial sums in c

I have to do partial sums using parallel reduction approach in C. but I doesn't have any idea about it. So, I need guidance of Community to achieve this.

What I need to achieve: for example, computational thread, then in first reduction step add two element in array, thread 4 should wait for thread 0 done, same as thread 5 has to wait for thread 1 done, thread 6 waits for thread 2 & thread 7 should waits for thread 3 done .

now in second step, , thread 6 waits for thread 4 finished, and thread 7 waits for thread 5 finished. , thread 6 waits for thread 4 finished, and thread 7 waits for thread 5 finished.

In the third step, thread 7 waits for thread 6 done. then need to print whole array

Please help me, give me a guidance to achieve this one.

Upvotes: 0

Views: 204

Answers (2)

Qiang
Qiang

Reputation: 11

I think OpenMP is more suitable for your question. It support C and don't need to change too much on your code. Following solution modified based on the Intel document. And works well on Ubuntu 22.04 with gcc 11.

// parallel_reduction.c
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

int main (int argc, char **argv)
{
    uint32_t i = 0, n=0xfffffff;
    float *array, total=0.0f;

    array = (float *)malloc(sizeof(float) * n);

#pragma omp parallel for reduction(+:total)
    for (i = 0; i < n; ++i) {
        total += array[i];
    }

    free(array);

    return 0;
}

Compile the code with gcc (or g++ if you're using C++), and benchmark on the performance.

gcc -fopenmp parallel_reduction.c -o pr
time ./pr

real    0m0.078s
user    0m2.806s
sys     0m5.854s

OpenMP can utilize all of your CPU cores by default, and this behavior can be changed by environment variable.

time OMP_NUM_THREADS=4 ./pr

This code improve more than 10x performance on my side. For your reference.

The Intel oneTBB library also include "parallel_reduce" algorithm. But it only support C++, and need more modification on your code. The advantage is it's a library, and don't need special support from compiler.

[Reference]

oneTBB: parallel_reduce

OpenMP: OpenMP Reduction Operations

Upvotes: 1

Employed Russian
Employed Russian

Reputation: 213877

I'm confused about how to manage computing thread a given thread should wait. 2^r-i, where r = log(m).

There are at least two easy ways to do that:

  • save all thread-ids in a global pthread_t ptid[N]; array.
  • save all thread-ids in a local pthread_t ptid[N]; array, and pass the address of that array into each thread via the thread argument.

Example pseudo-code for the latter (error handling omitted for clarity):

struct Arg {
  pthread_t *ptid;  // Address of ptid[N].
  int idx;          // Index of this thread.
};

void *partial_sum(void *p)
{
  struct Arg *arg = (struct Arg *)p;

  int sum = 0;
  ... // Compute my portion of the partial sum.
  
  int other_thread_idx = ...;  // Compute index of the other thread
                               // this thread should join, -1 if none.
  if (other_thread_idx >= 0) {
    int other_sum;

    // Get the result from other thread.
    pthread_join(arg->ptid[other_thread_idx], (void*) &other_sum);
    printf("Thread %d joined thread %d which returned sum %d\n",
           arg->idx, other_thread_idx, other_sum);
    sum += other_sum;
  }
  printf("Thread %d, sum: %d\n", arg->idx, sum);
  return (void*) sum;
}

int main()
{
  struct Arg args[N];
  pthread_t ptid[N];

  for (int i = 0; i < N; ++i) {
    struct Arg* arg = &args[i];
    arg->idx = i;
    arg->ptid = &ptid[0];
    pthread_create(&ptid[i], NULL, partial_sum, arg);
  }

  // Get the final result.
  int sum;

  // Note: joining only the last thread -- all others have been joined
  // already.
  pthread_join(ptid[N - 1], (void*) &sum);

  printf("Sum: %d\n", sum);
  return 0;
}

Upvotes: 0

Related Questions