Reputation: 23
I'm trying to add the contents of a vector (v) using threads on another vector (localSum) as the following codes shows:
void threadsum(int threadID, int numThreads, const vector<double>& v, vector<double>& localSum)
{
size_t start = threadID * v.size() / numThreads;
size_t stop = (threadID + 1) * v.size() / numThreads;
localSum[threadID] = 0.0;
for (size_t i = start; i < stop; i++) {
localSum[threadID] += v[i];
}
}
Right now, I'm having performance issues regarding the problem of false cache sharing, because every thread is trying to write on different places on the same cache line. The vector v and the vector of threads localSum are declared as follows:
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
Now, how can I avoid having this problem?. One idea I got is to use mutex to restrict the timing of access to localSum. Other idea I have was maybe to missalign the elements of the vector so they won't be on the same cache line?. Any idea to solve this problem would be much appreciated!.
Upvotes: 2
Views: 333
Reputation: 32732
Accumulate the sum for each thread in a local variable, then save that out into localSum
at the end of your loop.
size_t stop = (threadID + 1) * v.size() / numThreads;
double sum = 0.0;
for (size_t i = start; i < stop; i++) {
sum += v[i];
}
localSum[threadID] = sum;
You'll still have that issue with cache line sharing, but you'll only be doing one write instead of N. Also, with the loop in this form, the optimizer should be able to do a better job.
Upvotes: 1