Krzysztof Kosiński
Krzysztof Kosiński

Reputation: 4325

Updating a maximum value from multiple threads

Is there a way to update a maximum from multiple threads using atomic operations?

Illustrative example:

std::vector<float> coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
    int j = get_coord(i); // can return any value in range [0,128)
    float x = compute_value(j, i);
    #pragma omp critical (coord_max_update)
    coord_max[j] = std::max(coord_max[j], x);
}

In the above case, the critical section synchronizes access to the entire vector, whereas we only need to synchronize access to each of the values independently.

Upvotes: 5

Views: 2699

Answers (4)

Krzysztof Kosiński
Krzysztof Kosiński

Reputation: 4325

Following a suggestion in a comment, I found a solution that does not require locking and instead uses the compare-and-exchange functionality found in std::atomic / boost::atomic. I am limited to C++03 so I would use boost::atomic in this case.

BOOST_STATIC_ASSERT(sizeof(int) == sizeof(float));
union FloatPun { float f; int i; };

std::vector< boost::atomic<int> > coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
    int j = get_coord(i);
    FloatPun x, maxval;
    x.f = compute_value(j, i);

    maxval.i = coord_max[j].load(boost::memory_order_relaxed);
    do {
        if (maxval.f >= x.f) break;
    } while (!coord_max[j].compare_exchange_weak(maxval.i, x.i,
        boost::memory_order_relaxed));
}

There is some boilerplate involved in putting float values in ints, since it seems that atomic floats are not lock-free. I am not 100% use about the memory order, but the least restrictive level 'relaxed' seems to be OK, since non-atomic memory is not involved.

Upvotes: 6

Massimiliano
Massimiliano

Reputation: 8032

Just to add my two cents, before starting more fine-grained optimizations I would try the following approach that removes the need for omp critical:

std::vector<float> coord_max(128);  
float              fbuffer(0);
#pragma omp parallel 
{
  std::vector<float> thread_local_buffer(128);  

  // Assume limit is a really big number
  #pragma omp for       
  for (int ii = 0; ii < limit; ++ii) {
   int  jj = get_coord(ii); // can return any value in range [0,128)
   float x = compute_value(jj,ii);
   thread_local_buffer[jj] = std::max(thread_local_buffer[jj], x);
  } 
  // At this point each thread has a partial local vector
  // containing the maximum of the part of the problem 
  // it has explored

  // Reduce the results
  for( int ii = 0; ii < 128; ii++){
    // Find the max for position ii
#pragma omp for schedule(static,1) reduction(max:fbuffer)
    for( int jj = 0; jj < omp_get_thread_num(); jj++) {
      fbuffer = thread_local_buffer[ii];
    } // Barrier implied here
    // Write it in the vector at correct position
#pragma omp single
    {
      coord_max[ii] = fbuffer;
      fbuffer = 0;
    } // Barrier implied here

  }
}

Notice that I didn't compile the snippet, so I might have left some syntax error inside. Anyhow I hope I have conveyed the idea.

Upvotes: 1

Reunanen
Reunanen

Reputation: 8001

How about declaring, say, a std::vector<std::mutex> (or boost::mutex) of length 128 and then creating a lock object using the jth element?

I mean, something like:

std::vector<float> coord_max(128);
std::vector<std::mutex> coord_mutex(128); 
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
    int j = get_coord(i); // can return any value in range [0,128)
    float x = compute_value(j, i);
    std::scoped_lock lock(coord_mutex[j]);
    coord_max[j] = std::max(coord_max[j], x);     
}

Or, as per Rahul Banerjee's suggestion #3:

std::vector<float> coord_max(128);
const int parallelism = 16;
std::vector<std::mutex> coord_mutex(parallelism); 
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
    int j = get_coord(i); // can return any value in range [0,128)
    float x = compute_value(j, i);
    std::scoped_lock lock(coord_mutex[j % parallelism]);
    coord_max[j] = std::max(coord_max[j], x);     
}

Upvotes: 1

Rahul Banerjee
Rahul Banerjee

Reputation: 2363

Not sure about the syntax, but algorithmically, you have three choices:

  1. Lock down the entire vector to guarantee atomic access (which is what you are currently doing).

  2. Lock down individual elements, so that every element can be updated independent of others. Pros: maximum parallelism; Cons: lots of locks required!

  3. Something in-between! Conceptually think of partitioning your vector into 16 (or 32/64/...) "banks" as follows: bank0 consists of vector elements 0, 16, 32, 48, 64, ... bank1 consists of vector elements 1, 17, 33, 49, 65, ... bank2 consists of vector elements 2, 18, 34, 50, 66, ... ... Now, use 16 explicit locks before you access the element and you can have upto 16-way parallelism. To access element n, acquire lock (n%16), finish the access, then release the same lock.

Upvotes: 1

Related Questions