Why is the multithreaded version of this program slower?

Question

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.

When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.

Here is the code:

#include 
#include 
#include 
#include 
#include 
#include 

#define TIME_INTERVAL 100
#define CHANGES 5000

#define UNUSED(x) ((void) x)

typedef struct {
    unsigned int tid;
} parm;

static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;

static unsigned long int diff[NTHREADS] = {0};

void* array_modifier(void* args);
void* change_detector(void* arg);

int main(int argc, char** argv) {
    if (argc < 2) {
        exit(1);
    }

    N = (unsigned int)strtoul(argv[1], NULL, 0);

    my_array = calloc(N, sizeof(int));
    time_array = malloc(N * sizeof(struct timeval));
    old_value = calloc(N, sizeof(int));

    parm* p = malloc(NTHREADS * sizeof(parm));
    pthread_t generator_thread;
    pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));

    for (unsigned int i = 0; i < NTHREADS; i++) {
        p[i].tid = i;
        pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
    }

    pthread_create(&generator_thread, NULL, array_modifier, NULL);

    pthread_join(generator_thread, NULL);

    usleep(500);

    for (unsigned int i = 0; i < NTHREADS; i++) {
        pthread_cancel(detector_thread[i]);
    }

    for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
    fprintf(stderr, "
");
    _exit(0);
}


void* array_modifier(void* arg) {
    UNUSED(arg);
    srand(time(NULL));

    unsigned int changing_signals = CHANGES;

    while (changing_signals--) {
        usleep(TIME_INTERVAL);
        const unsigned int r = rand() % N;

        gettimeofday(&time_array[r], NULL);
        my_array[r] ^= 1;
    }

    pthread_exit(NULL);
}

void* change_detector(void* arg) {
    const parm* p = (parm*) arg;
    const unsigned int tid = p->tid;
    const unsigned int start = tid * (N / NTHREADS) +
                               (tid < N % NTHREADS ? tid : N % NTHREADS);
    const unsigned int end = start + (N / NTHREADS) +
                             (tid < N % NTHREADS);
    unsigned int r = start;

    while (1) {
        unsigned int tmp;
        while ((tmp = my_array[r]) == old_value[r]) {
            r = (r < end - 1) ? r + 1 : start;
        }

        old_value[r] = tmp;
        if (tmp) {
            struct timeval tv;
            gettimeofday(&tv, NULL);
            // detection time in usec
            diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
        }
    }
}

when I compile & run like this:

gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100

I get:

but when I compile & run like this:

gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100

I get:

152 190 164 242

(this sums up to 748).

So, the delay for the multithreaded program is larger.

My cpu has 6 cores.

doron · Accepted Answer

Short Answer You are sharing memory between thread and sharing memory between threads is slow.

Long Answer Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.

Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.

Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.

Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.

Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.

Aside Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.

Why is the multithreaded version of this program slower?

Answers (2)

Related Questions