Reputation: 98
I've written a program for search of the maximum in arrays using c++0x threads (for learning purposes). For implementation I used standard thread and future classes. However, parallelized function constantly showes same or worse run time than non-parallelized.
Code is below. I tried to store data in one-dimensional array, multi-dimensional array and ended up with several arrays. However, no option has given good results. I tried to compile and run my code from Eclipse and command line, still with no success. I also tried similar test without array usage. Parallelization gave only 20% speed up there. From my point of view, I run very simple parallel program, without locks and almost no resource sharing (each thread operates on his own array). What is bottleneck?
My machine has Intel Core i7 processor 2.2 GHz with 8 GB of RAM, running Ubuntu 12.04.
const int n = 100000000;
int a[n], b[n], c[n], d[n];
int find_max_usual() {
int res = 0;
for (int i = 0; i < n; ++i) {
res = max(res, a[i]);
res = max(res, b[i]);
res = max(res, c[i]);
res = max(res, d[i]);
}
return res;
}
int find_max(int *a) {
int res = 0;
for (int i = 0; i < n; ++i)
res = max(res, a[i]);
return res;
}
int find_max_parallel() {
future<int> res_a = async(launch::async, find_max, a);
future<int> res_b = async(launch::async, find_max, b);
future<int> res_c = async(launch::async, find_max, c);
future<int> res_d = async(launch::async, find_max, d);
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
return res;
}
double get_time() {
timeval tim;
gettimeofday(&tim, NULL);
double t = tim.tv_sec + (tim.tv_usec / 1000000.0);
return t;
}
int main() {
for (int i = 0; i < n; ++i) {
a[i] = rand();
b[i] = rand();
c[i] = rand();
d[i] = rand();
}
double start = get_time();
int x = find_max_usual();
cerr << x << " " << get_time() - start << endl;
start = get_time();
x = find_max_parallel();
cerr << x << " " << get_time() - start << endl;
return 0;
}
Timing showed that almost all the time in find_max_parralel is consumed by
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
Compilation command line
g++ -O3 -std=c++0x -pthread x.cpp
Update. Problem is solved. I got desired results with same test. 4 threads give about 3.3 speed up, 3 threads give about 2.5 speed up, 2 threads behave almost ideally with 1.9 speed up. I've just rebooted system with some new updates. I haven't seen any significant difference in cpu load and running porgrams.
Thanks to all for help.
Upvotes: 8
Views: 1117
Reputation: 31
I just ran the same test with gcc-4.7.1 and threaded version is roughly 4 times faster (on 4-core server). So the problem is obviously not in std::future implementation, but in choosing threading settings not optimal for your environment. As it was noted above, you test is not CPU, but memory intensive, so the bottleneck is definitely memory access. You'd probably want to run some cpu-intensive test (like computing PI number with high precision) to benchmark threading properly.
Without experimenting with different number of threads and different array sizes, it's hard to say, where exactly the bottleneck is, but there are probably few things in play: - You probably have 2-channel memory controller (it's either 2, or 3), so going above 2 threads will just introduce additional contention around memory access. Thus your thesis about having no locking and no resource sharing is not correct: on hardware level there's contention around concurrent memory access. - Non-parallel version will be efficiently optimized by pre-fetching data into cache. On other hand, there's chance, that in parallel version you end up with intensive context switching, and as result thrashing CPU cache.
For both factors you are likely to see a speedup, if you tune down number of threads to 2.
Upvotes: 3
Reputation: 34518
You have to explicitly set std::launch::async
.
future<int> res_c = async(std::launch::async, find_max, c);
If you omit the flag std::launch::async | std::launch::deferred
is assumend which leaves it up to implementation to choose whether to start the task asynchronously or deferred.
Current versions of gcc use std::launch::deferred
, MSVC has an runtime scheduler which decides on runtime how the task should be run.
Also note that if you want to try:
std::async(find_max, c);
this will also block because the destructor of std::future
waits for the task to finish.
Upvotes: 14