Reputation: 347
I am running Ubuntu on a machine with 32 CPUs (1 socket, 16 cores per socket, 2 threads per core).
I have a std::vector containing ~100-1000 objects and I am trying to parallelize a for loop which reads data from each object in the vector and writes to a file to log the state of each object. There is one file for each object. I've played around with omp_set_num_threads(8)
and discovered that there is a sweet spot of around 8 threads. If I increase or decrease the number of threads the runtime performance will decrease. Given that I have 32 available CPUs I am not sure why increasing the thread count above 8 decreases runtime performance. I know many similar questions have been asked previously but I cannot seem to find a solution to my particular issue.
#include <algorithm>
#include <experimental/filesystem>
#include <omp.h>
namespace fs = std::experimental::filesystem;
void log() {
omp_set_num_threads(8);
// Log all object states
if(this->logstate){
#pragma omp parallel for
for(auto i = vObject.begin(); i < vObject.end(); ++i) {
fs::path filename = (*i)->get_filename();
std::ofstream OutputFile;
OutputFile.open(filename, std::ios::app);
OutputFile << std::setw(30) << (*i)->get_EPOCH() << std::setw(20) << std::scientific << std::setprecision(5) << (*i)->get_state() << std::endl;
OutputFile.close();
}
}
}
Any thoughts or suggestions would be appreciated.
Upvotes: 0
Views: 1783
Reputation: 124997
I am running Ubuntu on a machine with 32 CPUs (1 socket, 16 cores per socket, 2 threads per core).
A thread is not a CPU. A core isn't even really a CPU; cores can each execute code independently, but they all share a single memory bus and various other resources.
So, you've got 32 threads running on 16 cores all sharing a single bus. At some point there's going to be contention for something, and that means that a lot of those threads have to sit around and wait. More threads -> more contention -> more waiting.
Now, we can all make educated guesses about where the resource contention might be — is it the file system, the memory bus, or something else? But we don't know much about your system and what else might be going on, so that's probably pointless. Profiling your code running on more threads to see where it's doing a lot of waiting could suggest some answers. Just keep in mind that those answers are likely to be specific to your current situation; if you change the computational work that needs to be done for each object, or change the way the output is collected, etc., that'll likely affect where the sweet spot is.
Upvotes: 3