Reputation: 53027
I'm in a notebook with Apple M1, which has 4 cores. I have a problem which can be solved in parallel by spawning millions threads dynamically. Since pthread_create
has a huge overhead, I'm avoiding it by, instead, leaving 3 threads in the background. These threads wait for tasks to arrive:
void *worker(void *arg) {
u64 tid = (u64)arg;
// Loops until the main thread signals work is complete
while (!atomic_load(&done)) {
// If there is a new task for this worker...
if (atomic_load(&workers[tid].stat) == WORKER_NEW_TASK) {
// Execute it
execute_task(&workers[tid]);
}
}
return 0;
}
These threads are spawned with pthread_create
once:
pthread_create(&workers[tid].thread, NULL, &normal_thread, (void*)tid)
Any time I need a new task to be done, instead of calling pthread_create
again, I just select an idle worker and send the task to it:
workers[tid].stat = WORKER_NEW_TASK
workers[tid].task = ...
The problem is: for some reason, leaving these 3 threads on the background makes my main thread 25% slower. Since my CPU has 4 cores, I expected these 3 threads to not affect the main thread at all.
Why are the background threads slowing down the main thread? Am I doing anything wrong? Is the while (!atomic_load(&done))
loop consuming a lot of CPU power?
Upvotes: 0
Views: 261
Reputation: 53027
The issue is that I had a thread like this:
typedef struct {
atomic_int has_work;
} Worker;
Worker workers[MAX_THREADS];
void *worker(void *arg) {
u64 tid = (u64)arg;
while (1) {
if (atomic_load(workers[tid].has_work)) {
// do stuff
}
}
}
(...)
Notice how I used an atomic flag, has_work
, to "wake up" the worker. As pointed by @WhozCraig on his comment, it may not be a good idea to spin-look on atomic flags. This, if I understand correctly, is computation-intensive, which overwelms the CPU. His suggestion was to use mutex with conditional vars, as described on "Programming with POSIX Threads" by Butenhof, section 3.3. This stack overflow answer has a snippet showing the common usage of that pattern. The resulting code should look like this:
typedef struct {
int has_work;
pthread_mutex_t has_work_mutex;
pthread_cond_t has_work_signal;
}
Worker workers[MAX_THREADS];
void *worker(void *arg) {
u64 tid = (u64)arg;
while (1) {
pthread_mutex_lock(&workers[tid].has_work_mutex);
while (!workers[tid].has_work) {
pthread_cond_wait(&workers[tid].has_work_signal, &workers[tid].has_work_mutex);
}
// do stuff
pthread_mutex_unlock(&workers[tid].has_work_mutex);
}
return 0;
}
Notice how has_work
was replaced to use a normal int
instead of an atomic one, with a mutex
guarding it. Then, a "conditional variable" is associated with that mutex. This allows us to use pthread_cond_wait
to sort-of sleep the thread until another thread signals that "has_work" might be true. This has the same effect as the version using atomics, but is less CPU-hungry and should perform better.
Upvotes: 2