Matei David
Matei David

Reputation: 2362

How does OpenMP implement access to critical sections?

I want to read an input file (in C/C++) and process each line independently as fast as possible. The processing takes a few ticks itself, so I decided to use OpenMP threads. I have this code:

#pragma omp parallel num_threads(num_threads)
{
  string line;
  while (true) {
#pragma omp critical(input)
    {
      getline(f, line);
    }
    if (f.eof())
      break;
    process_line(line);
  }
}

My question is, how do I determine the optimal number of threads to use? Ideally, I would like this to be dynamically detected at runtime. I don't understand the DYNAMIC schedule option for parallel, so I can't say if that would help. Any insights?

Also, I'm not sure how to determine the optimal number "by hand". I tried various numbers for my specific application. I would have thought the CPU usage reported by top would help, but it doesn't(!) In my case, the CPU usage stays consistently at around num_threads*(85-95). However, using pv to observe the speed at which I'm reading input, I noted that the optimal number is around 2-5; above that, the input speed becomes smaller. So my quesiton is- why would I then see a CPU usage of 850 when using 10 threads?? Can this be due to some inefficiency in how OpenMP handles threads waiting to get in the critical section?

EDIT: Here are some timings. I obtained them with:

for NCPU in $(seq 1 20) ; do echo "NCPU=$NCPU" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; PID=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; USAGE=$(ps h -o "%cpu" $PID) ; kill -9 $PID ; sleep 1 ; echo "usage: $USAGE" >&2 ; } & cat ; } | ./my_prog -N $NCPU >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v Killed ; done

NCPU=1 [8.27MB/s] usage: 98.4

NCPU=2 [12.5MB/s] usage: 196

NCPU=3 [18.4MB/s] usage: 294

NCPU=4 [23.6MB/s] usage: 393

NCPU=5 [28.9MB/s] usage: 491

NCPU=6 [33.7MB/s] usage: 589

NCPU=7 [37.4MB/s] usage: 688

NCPU=8 [40.3MB/s] usage: 785

NCPU=9 [41.9MB/s] usage: 884

NCPU=10 [41.3MB/s] usage: 979

NCPU=11 [41.5MB/s] usage: 1077

NCPU=12 [42.5MB/s] usage: 1176

NCPU=13 [41.6MB/s] usage: 1272

NCPU=14 [42.6MB/s] usage: 1370

NCPU=15 [41.8MB/s] usage: 1493

NCPU=16 [40.7MB/s] usage: 1593

NCPU=17 [40.8MB/s] usage: 1662

NCPU=18 [39.3MB/s] usage: 1763

NCPU=19 [38.9MB/s] usage: 1857

NCPU=20 [37.7MB/s] usage: 1957

My problem is that I can achieve 40MB/s with 785 CPU usage, but also with 1662 CPU usage. Where do those extra cycles go??

EDIT2: Thanks to Lirik and John Dibling, I now understand that the reason I find the timings above puzzling has nothing to do with I/O, but rather, with the way OpenMP implements critical sections. My intuition is that if you have 1 thread inside a CS and 10 threads waiting to get in, the moment the 1st thread exits the CS, the kernel should wake up one other thread and let it in. The timings suggest otherwise: can it be that the threads wake up many times on their own and find the CS occupied? Is this an issue with the threading library or with the kernel?

Upvotes: 1

Views: 1209

Answers (2)

Kiril
Kiril

Reputation: 40375

"I want to read an input file (in C/C++) and process each line independently as fast as possible."

Reading from file makes your application I/O bound, so the maximum performance you would be able to achieve for the reading portion alone is to read at the maximum disk speed (on my machine that's less than 10% CPU time). What this means is that if you were able to completely free the reading thread from any processing, it would require that the processing takes less than the remaining CPU time (90% with my computer). If the line processing threads take up more than the remaining CPU time, then you will not be able to keep up with the hard drive.

There are several options in that case:

  1. Queue up the input and let the processing threads dequeue "work" until they've caught up with the input that was presented (given that you have enough RAM to do so).
  2. Open enough threads and just max out your processor until all the data is read, which is your best effort scenario.
  3. Throttle reading/processing so that you don't take up all of the system resources (in case you're worried about UI responsiveness and/or user experience).

"...the processing takes a few ticks itself, so I decided to use OpenMP threads"

This is a good sign, but it also means that your CPU utilization will not be very high. This is the part where you can optimize your performance and it's probably best to do it by hand, as John Dibling mentioned. In general, it's best if you queue up each line and let the processing threads pull processing requests from the queue until you have nothing more to process. The latter is also know as a Producer/Consumer design pattern- a very common pattern in concurrent computing.

Update

Why is there be a difference between

  • (i) each process get lock, pull data, release lock, process data; and
  • (ii) one process: pull data, get lock, enqueue chunk, release lock,
  • others: get lock, dequeue chunk, release lock, process data?

There is very little difference: in a way, both represent a consumer/producer pattern. In the first case (i) you don't have an actual queue, but you could consider the file stream to be your Producer (queue) and the Consumer is the thread that reads from the stream. In the second case (ii) you're explicitly implementing the consumer/producer pattern, which is more robust, reusable and provides better abstraction for the Producer. If you ever decide to use more than one "input channel," then the latter case is better.

Finally (and probably most importantly), you can use a lock-free queue with a single producer and a single consumer which will make (ii) a lot faster than (i) in terms of getting you to be i/o bound. With a lock-free queue you can pull data, enqueue chunk and dequque chunk without locking.

Upvotes: 2

John Dibling
John Dibling

Reputation: 101476

The best you can hope to do is tune it yourself, by hand, through repeated measure-adjust-compare cycles.

The optimal number of threads to use to process a dataset is highly dependant on many factors, not the least of which:

  1. The data itself
  2. The algoritm you use to process it
  3. The CPU(s) the threads are running on
  4. The operating system

You could try to design some kind of heuristic that measures the throughput of your processors and adjust it on the fly, but this kind of thing tends to be way more trouble than its worth.

As a rule, for tasks that are I/O bound I generally start with about 12 threads per core and tune from there. For tasks that are CPU bound, I'd start with about 4 threads per core and go from there. The key is the "going from there" part if you really want to optimize your processor use.

Also keep in mind that you should make this setting configurable if you really want to optimize, because every system onn which this is deployed will have different characteristics.

Upvotes: 1

Related Questions