MetallicPriest
MetallicPriest

Reputation: 30765

Is this approach of barriers right?

I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?

EDIT

Upvotes: 9

Views: 1162

Answers (3)

R.. GitHub STOP HELPING ICE
R.. GitHub STOP HELPING ICE

Reputation: 215259

Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:

  1. NUM_OF_THREADS-1 threads are waiting at the barrier, spinning.
  2. Last thread arrives and passes through the barrier.
  3. Last thread exits barrier, continues processing, finishes its next task, and reenters the barrier wait.
  4. Only now do the other waiting threads get scheduled, and they can't exit the barrier because the counter was incremented again. Deadlock.

In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...

How can barriers be destroyable as soon as pthread_barrier_wait returns?

Can a correct fail-safe process-shared barrier be implemented on Linux?

And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.

Upvotes: 8

Grizzly
Grizzly

Reputation: 20191

Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.

Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.

As a side note: since you use atomic<int> I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield() instead of sched_yield() in that case to remove the dependency on pthreads?

This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU) case, except for the yielding)

Upvotes: 2

Pierre Habouzit
Pierre Habouzit

Reputation: 698

AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.

There's a way to make fast barriers though, using eventcounts (look at it through google).

struct barrier {
    atomic<int>       count;
    struct eventcount ec;
};

void my_barrier_wait(struct barrier *b)
{
    eventcount_key_t key;

    if (--b->count == 0) {
        eventcount_broadcast(&b->ec);
        return;
    }
    for (;;) {
        key = eventcount_get(&b->ec);
        if (!b->count)
            return;
        eventcount_wait(&b->ec);
    }
}

This should scale way better.

Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.

Upvotes: 3

Related Questions