Reputation: 77
I created very simple app to figure out how boost::thread works. I found result of this test suprising. 4 threads of execution finish computation 2 times faster than 1 thread. I expected 4x boost. Another question is why 8 threads did not bring any performance boost?
I'm using boost 1.46.1 and VS2008. Full source code is below. Program was run on Core i5 750 machine.
#include <iostream>
#include <vector>
#include <cmath>
#include <boost/thread.hpp>
#include <boost/timer.hpp>
typedef unsigned int uint;
struct Vector {
float x, y, z;
Vector() : x(0.f), y(0.f), z(0.f) {}
float len() {
return sqrtf(x*x + y*y + z*z);
}
};
float norm(int a) {
return float((a % 10) + 1) / 10.f;
}
void genVectors(std::vector<Vector>& examples) {
srand(GetTickCount());
for (uint i = 0; i < examples.size(); ++i) {
examples[i].x = norm(rand());
examples[i].y = norm(rand());
examples[i].z = norm(rand());
}
}
typedef std::vector<Vector> Data;
typedef Data::iterator DataIter;
typedef std::vector<float> Result;
typedef Result::iterator ResultIter;
struct Worker {
Data data;
Result result;
Worker(DataIter& dataStart,
const DataIter& dataEnd,
ResultIter& resultStart,
const ResultIter& resultEnd) : data(dataStart, dataEnd), result(resultStart, resultEnd) {
assert(data.size() == result.size());
}
void operator()() {
DataIter di = data.begin();
ResultIter ri = result.begin();
const DataIter dend = data.end();
for (; di != dend; ++di, ++ri) {
*ri = di->len();
}
}
};
int main(int argc, char **argv) {
const uint numThreads = 4;
const uint seqLen = 13107200;
std::vector<Vector> a;
a.resize(seqLen);
genVectors(a);
std::vector<float> singleThreadResult(a.size());
assert(a.size() == singleThreadResult.size());
boost::timer singleThreadTimer;
for (uint i = 0; i < a.size(); ++i) {
singleThreadResult[i] = a[i].len();
}
double singleThreadTime = singleThreadTimer.elapsed();
std::vector<float> multiThreadResult(a.size());
Worker* workers[numThreads];
for (uint i = 0; i < numThreads; ++i) {
uint chunkSize = seqLen / numThreads;
assert(numThreads * chunkSize == seqLen);
workers[i] = new Worker(a.begin() + i*chunkSize,
a.begin() + (i+1)*chunkSize,
multiThreadResult.begin() + i*chunkSize,
multiThreadResult.begin() + (i+1)*chunkSize);
}
boost::timer multiThreadTimer;
boost::thread_group threads;
for (uint i = 0; i < numThreads; ++i) {
threads.create_thread(boost::ref(*workers[i]));
}
threads.join_all();
double multiThreadTime = multiThreadTimer.elapsed();
using namespace std;
cout << "Single thread time: " << singleThreadTime << endl;
cout << numThreads << " threads time: " << multiThreadTime << endl;
return 0;
}
Upvotes: 2
Views: 1831
Reputation: 4667
When you use system threading, there is no guarantee that every thread will be run on a separate core. You can't assign a core for a thread - that's the OS task. And given that you have 4 threads in your application, the OS can really run them on a single core depending on the overall CPU load and gazillion other factors.
Upvotes: 0
Reputation: 393709
for a scenario like this, by all means I'd prefer OpenMP #pragma parallel for
, or simply use gcc -fopenmp -D_GLIBCXX_PARALLEL
and (likely) get automatic parallelization...
Upvotes: 0
Reputation:
I've always found that it is impossible to predict the "best" number of threads for a given problem running on a given hardware configuration. My approach has been to parameterise the number of threads from the command line, and try various numbers until I hit the "sweet spot".
Upvotes: 0
Reputation: 7954
According to the Intel website, the Core i5 750 processor has 4 cores and supports 4 threads, so you shouldn't expect any more performance from 8 threads than from 4. By adding more threads to your software than you have processors (or hyperthreads) you are just adding more context switch overhead.
As to why 4 threads is no faster than 2, I would guess it is to do with the size of the working set of data. The data set is much bigger than the 8MB cache, so your test application is probably memory bandwidth limited.
To test this out, try benchmarking with a data set which fits in the cache.
Upvotes: 2
Reputation: 16769
You may have 4 cores in your Core i5 750 machine, but you still have single data bus. All of the used data (13107200 * 3 * sizeof(float) = 157 MB) must pass through this data bus. And then there is a resulting vector of "mere" 13107200 * sizeof(float) = 52 MB, which takes the same resource. All this is heavy on the cache, and 4 cores spend a lot of time waiting for memory to be available for either read or write.
Upvotes: 1