sourav mahmood
sourav mahmood

Reputation: 73

user time increase in multi-cpu job

I am running the following code:

when I run this code with 1 child process: i get the following timing information:

(I run using /usr/bin/time ./job 1)

5.489u 0.090s 0:05.58 99.8% (1 job running)

when I run with 6 children processes: i get following

74.731u 0.692s 0:12.59 599.0% (6 jobs running in parallel)

The machine I am running the experiment on has 6 cores, 198 GB of RAM and nothing else is running on that machine.

I was expecting the user time reporting to be 6 times in case of 6 jobs running in parallel. But it is much more than that (13.6 times). My questions is from where this increase in user time comes from? Is it because multiple cores are jumping from one memory location to another more frequently in case of 6 jobs running in parallel? Or there is something else I am missing.

Thanks

#define MAX_SIZE 7000000
#define LOOP_COUNTER 100

#define simple_struct struct _simple_struct
simple_struct {
    int n;
    simple_struct *next;
};

#define ALLOCATION_SPLIT 5
#define CHAIN_LENGTH 1
void do_function3(void)
{
    int i = 0, j = 0, k = 0, l = 0;
    simple_struct **big_array = NULL;
    simple_struct *temp = NULL;

    big_array = calloc(MAX_SIZE + 1, sizeof(simple_struct*));


    for(k = 0; k < ALLOCATION_SPLIT; k ++) {
        for(i =k ; i < MAX_SIZE; i +=ALLOCATION_SPLIT) {
            big_array[i] = calloc(1, sizeof(simple_struct));
            if((CHAIN_LENGTH-1)) {
                for(l = 1; l < CHAIN_LENGTH; l++) {
                    temp = calloc(1, sizeof(simple_struct));
                    temp->next = big_array[i];
                    big_array[i] = temp;
                }
            }
        }
    }

    for (j = 0; j < LOOP_COUNTER; j++) {
        for(i=0 ; i < MAX_SIZE; i++) {
            if(big_array[i] == NULL) {
                big_array[i] = calloc(1, sizeof(simple_struct));
            }
            big_array[i]->n = i * 13;
            temp = big_array[i]->next;
            while(temp) {
                temp->n = i*13;
                temp = temp->next;
            }
        }
    }
}

int main(int argc, char **argv)
{
    int i, no_of_processes = 0;
    pid_t pid, wpid;
    int child_done = 0;
    int status;
    if(argc != 2) {
        printf("usage: this_binary number_of_processes");
        return 0;
    }

    no_of_processes = atoi(argv[1]);

    for(i = 0; i < no_of_processes; i ++) {
        pid = fork();

        switch(pid) {
            case -1:
                printf("error forking");
                exit(-1);
            case 0:
                do_function3();
                return 0;
            default:
                printf("\nchild %d launched with pid %d\n", i, pid);
                break;
        }
    }

    while(child_done != no_of_processes) {
        wpid = wait(&status);
        child_done++;
        printf("\nchild done with pid %d\n", wpid);
    }

    return 0;
}

Upvotes: 1

Views: 58

Answers (1)

o9000
o9000

Reputation: 1666

Firstly, your benchmark is a bit unusual. Normally, when benchmarking concurrent applications, one would compare two implementations:

  • A single thread version solving a problem of size S;
  • A multi-thread version with N threads, solving cooperatively the problem of size S; in your case, each solving a problem of size S/N.

Then you divide the execution times to obtain the speedup.

If your speedup is:

  • Around 1: the parallel implementation has similar performance as the single thread implementation;
  • Higher than 1 (usually between 1 and N), parallelizing the application increases performance;
  • Lower than 1: parallelizing the application hurts performance.

The effect on performance depends on a variety of factors:

  • How well your algorithm can be parallelized. See Amdahl's law. Does not apply here.
  • Overhead in inter-thread communication. Does not apply here.
  • Overhead in inter-thread synchronization. Does not apply here.
  • Contention for CPU resources. Should not apply here (since the number of threads is equal to the number of cores). However HyperThreading might hurt.
  • Contention for memory caches. Since the threads do not share memory, this will decrease performance.
  • Contention for accesses to main memory. This will decrease performance.

You can measure the last 2 with a profiler. Look for cache misses and stalled instructions.

Upvotes: 2

Related Questions