Reputation: 16484
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
I have set the optimization to 'O2' - I will create a table with the results.
Original Code with no optimization or register variable: 1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable: 1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization: 1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables: 1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector: 1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB): 1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB): 1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Upvotes: 13
Views: 10094
Reputation: 2079
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at()
function (as well as the std::vector<T>::operator[]()
) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate()
will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
Upvotes: 2
Reputation: 471209
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result
variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result
variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result
.
Normally, the compiler should put *result
into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
Upvotes: 20
Reputation: 13960
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
Upvotes: 1
Reputation: 274
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.
Upvotes: 0