Reputation: 2362
I want to read an input file (in C/C++) and process each line independently as fast as possible. The processing takes a few ticks itself, so I decided to use OpenMP threads. I have this code:
#pragma omp parallel num_threads(num_threads)
{
string line;
while (true) {
#pragma omp critical(input)
{
getline(f, line);
}
if (f.eof())
break;
process_line(line);
}
}
My question is, how do I determine the optimal number of threads to use? Ideally, I would like this to be dynamically detected at runtime. I don't understand the DYNAMIC schedule option for parallel
, so I can't say if that would help. Any insights?
Also, I'm not sure how to determine the optimal number "by hand". I tried various numbers for my specific application. I would have thought the CPU usage reported by top
would help, but it doesn't(!) In my case, the CPU usage stays consistently at around num_threads*(85-95). However, using pv
to observe the speed at which I'm reading input, I noted that the optimal number is around 2-5; above that, the input speed becomes smaller. So my quesiton is- why would I then see a CPU usage of 850 when using 10 threads?? Can this be due to some inefficiency in how OpenMP handles threads waiting to get in the critical section?
EDIT: Here are some timings. I obtained them with:
for NCPU in $(seq 1 20) ; do echo "NCPU=$NCPU" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; PID=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; USAGE=$(ps h -o "%cpu" $PID) ; kill -9 $PID ; sleep 1 ; echo "usage: $USAGE" >&2 ; } & cat ; } | ./my_prog -N $NCPU >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v Killed ; done
NCPU=1 [8.27MB/s] usage: 98.4
NCPU=2 [12.5MB/s] usage: 196
NCPU=3 [18.4MB/s] usage: 294
NCPU=4 [23.6MB/s] usage: 393
NCPU=5 [28.9MB/s] usage: 491
NCPU=6 [33.7MB/s] usage: 589
NCPU=7 [37.4MB/s] usage: 688
NCPU=8 [40.3MB/s] usage: 785
NCPU=9 [41.9MB/s] usage: 884
NCPU=10 [41.3MB/s] usage: 979
NCPU=11 [41.5MB/s] usage: 1077
NCPU=12 [42.5MB/s] usage: 1176
NCPU=13 [41.6MB/s] usage: 1272
NCPU=14 [42.6MB/s] usage: 1370
NCPU=15 [41.8MB/s] usage: 1493
NCPU=16 [40.7MB/s] usage: 1593
NCPU=17 [40.8MB/s] usage: 1662
NCPU=18 [39.3MB/s] usage: 1763
NCPU=19 [38.9MB/s] usage: 1857
NCPU=20 [37.7MB/s] usage: 1957
My problem is that I can achieve 40MB/s with 785 CPU usage, but also with 1662 CPU usage. Where do those extra cycles go??
EDIT2: Thanks to Lirik and John Dibling, I now understand that the reason I find the timings above puzzling has nothing to do with I/O, but rather, with the way OpenMP implements critical sections. My intuition is that if you have 1 thread inside a CS and 10 threads waiting to get in, the moment the 1st thread exits the CS, the kernel should wake up one other thread and let it in. The timings suggest otherwise: can it be that the threads wake up many times on their own and find the CS occupied? Is this an issue with the threading library or with the kernel?
Upvotes: 1
Views: 1209
Reputation: 40375
"I want to read an input file (in C/C++) and process each line independently as fast as possible."
Reading from file makes your application I/O bound, so the maximum performance you would be able to achieve for the reading portion alone is to read at the maximum disk speed (on my machine that's less than 10% CPU time). What this means is that if you were able to completely free the reading thread from any processing, it would require that the processing takes less than the remaining CPU time (90% with my computer). If the line processing threads take up more than the remaining CPU time, then you will not be able to keep up with the hard drive.
There are several options in that case:
"...the processing takes a few ticks itself, so I decided to use OpenMP threads"
This is a good sign, but it also means that your CPU utilization will not be very high. This is the part where you can optimize your performance and it's probably best to do it by hand, as John Dibling mentioned. In general, it's best if you queue up each line and let the processing threads pull processing requests from the queue until you have nothing more to process. The latter is also know as a Producer/Consumer design pattern- a very common pattern in concurrent computing.
Why is there be a difference between
- (i) each process get lock, pull data, release lock, process data; and
- (ii) one process: pull data, get lock, enqueue chunk, release lock,
- others: get lock, dequeue chunk, release lock, process data?
There is very little difference: in a way, both represent a consumer/producer pattern. In the first case (i) you don't have an actual queue, but you could consider the file stream to be your Producer (queue) and the Consumer is the thread that reads from the stream. In the second case (ii) you're explicitly implementing the consumer/producer pattern, which is more robust, reusable and provides better abstraction for the Producer. If you ever decide to use more than one "input channel," then the latter case is better.
Finally (and probably most importantly), you can use a lock-free queue with a single producer and a single consumer which will make (ii) a lot faster than (i) in terms of getting you to be i/o bound. With a lock-free queue you can pull data, enqueue chunk and dequque chunk without locking.
Upvotes: 2
Reputation: 101476
The best you can hope to do is tune it yourself, by hand, through repeated measure-adjust-compare cycles.
The optimal number of threads to use to process a dataset is highly dependant on many factors, not the least of which:
You could try to design some kind of heuristic that measures the throughput of your processors and adjust it on the fly, but this kind of thing tends to be way more trouble than its worth.
As a rule, for tasks that are I/O bound I generally start with about 12 threads per core and tune from there. For tasks that are CPU bound, I'd start with about 4 threads per core and go from there. The key is the "going from there" part if you really want to optimize your processor use.
Also keep in mind that you should make this setting configurable if you really want to optimize, because every system onn which this is deployed will have different characteristics.
Upvotes: 1