Hang Qi
Hang Qi

Reputation: 21

Merge Sort Time in C

This is my virtual machine:

CPU: 4 cores
RAM: 4096 MB
Operating System: Ubuntu 18.04 (64-bit)

Problem1: Why there is a threshold 4193790?

I write a merge sort in C:

#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
#include<sys/time.h>
#include"helper.c"
#define N 4193789

void merge(double* li,int left1,int right1,int left2,int right2,int size){
    double *li_tmp;
    li_tmp = (double *)malloc(sizeof(double)*size);
    int i = left1;
    int j = left2;
    int k = left1;

    while(i<=right1 && j<=right2){ 
        if(li[i] < li[j]){
            li_tmp[k] = li[i++]; 
        }     
        else{
            li_tmp[k] = li[j++];
        } 
        k++; 
    }
    if(i>right1){
        while(j<=right2){
            li_tmp[k++] = li[j++];
        }
    }
    else if(j>right2){
        while(i<=right1){
            li_tmp[k++] = li[i++];
        }
    }

    for(i=left1; i<right2+1; i++){
        li[i] = li_tmp[i];
    }
    free(li_tmp); 
}

void merge_sort(double* li,int left,int right,int size){
    if (left<right){        
        int mid = (left + right)/2;
        merge_sort(li,left,mid,size);
        merge_sort(li,mid+1,right,size);
        merge(li,left,mid,mid+1,right,size);
    }
}

int main(int argc, char *argv[])
{
    // Just call merge_sort(), check the correctness and print the time consumption.
    double *data;
    data = (double *)malloc(sizeof(double)*N);
    srand(1234567);
    gen_rand(data,N);

    struct timeval start, end;
    gettimeofday(&start, NULL);

    merge_sort(data, 0, N-1, N);

    gettimeofday(&end, NULL);
    double delta = ((end.tv_sec  - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;

    bool correct = true;
    for(int i=0;i<N-1;i++){
        if (data[i]>data[i+1]){
            correct = false;
        }
    }
    if (correct){
        printf("Correct!\n");
    }else{
        printf("Not correct!\n");
    }

    printf("time spent=%12.10f\n",delta);
}

This is my helper.c, just generate a random double array in double *a.

#include<stdio.h>
#include<stdlib.h>
#include<stdbool.h>

void gen_rand(double *a, int num){
    for (int i=0;i<num;i++){
        a[i] = 1.0 * rand() / RAND_MAX * num;
    }
}

I found a very weird scenario:
I know that merge sort is O(nlogn), so when I experiment with different array lengths, I find that the time consumption varies a lot in one area, and it doesn't fit O(nlogn).After many attempts, I found a threshold.
When I define N as 4193789, the time is 1s, but when I change N to 4193790, the time will increase to 34s!
I wonder why there is such a threshold.

vagrant@hang2:~/data/$ gcc -fopenmp merge_sort_main.c 
vagrant@hang2:~/data/$ ./a.out 
Correct!
time spent=1.1103340000
vagrant@hang2:~/data/$ gcc -fopenmp merge_sort_main.c 
vagrant@hang2:~/data/$ ./a.out 
Correct!
time spent=34.5053590000

Problem2: Why the omp method get slower when a big array (more than 4193790)?

Another problem with omp: This is my omp main :

#pragma omp parallel num_threads(4)
    {
    #pragma omp single 
        merge_sort_omp(data, 0, N-1, N);
    }

And merge_sort_omp():

void merge_sort_omp(double* li,int left,int right,int size){
    if (left<right){
        if (right-left>10000){        
            int mid = (left + right)/2;
            #pragma omp task firstprivate (li, left, mid)
            merge_sort_omp(li,left,mid,size);
            #pragma omp task firstprivate (li, mid, right)
            merge_sort_omp(li,mid+1,right,size);
            #pragma omp taskwait
            merge(li,left,mid,mid+1,right,size);
        }else{
            int mid = (left + right)/2;
            merge_sort_omp(li,left,mid,size);
            merge_sort_omp(li,mid+1,right,size);
            merge(li,left,mid,mid+1,right,size);
        }
    }
}

I tried N=4000000 and N=4193790 as follows:

vagrant@hang2:~/data$ gcc -fopenmp merge_sort_main.c 
vagrant@hang2:~/data$ ./a.out 
Correct!
time spent=1.1358180000
vagrant@hang2:~/data$ gcc -fopenmp merge_sort_omp_main.c 
vagrant@hang2:~/data$ ./a.out 
Correct!
time spent=0.4998150000
vagrant@hang2:~/data$ gcc -fopenmp merge_sort_main.c 
vagrant@hang2:~/data$ ./a.out 
Correct!
time spent=34.3504340000
vagrant@hang2:~/data$ gcc -fopenmp merge_sort_omp_main.c 
vagrant@hang2:~/data$ ./a.out 
Correct!
time spent=111.9368700000

I want to know why the parallel code is twice as fast as the serial code at N= 4000000, but the serial code is slower at N=4193790. Almost three times slower. I want to know why the omp get slower?

Upvotes: 1

Views: 163

Answers (1)

selbie
selbie

Reputation: 104539

Aside from the calls to build with -O2 or -O3 in the comments, the most expensive thing your code is doing is calling malloc and free to build a temporary array on every invocation of merge_sort. Memory allocations being done on each iteration of a high-performance loop can slow things down greatly.

The easy fix is just do a single allocation of that temp buffer exactly once - and to make it big enough for all scenarios.

Instead of this:

void merge(double* li,int left1,int right1,int left2,int right2,int size){
    double *li_tmp;
    li_tmp = (double *)malloc(sizeof(double)*size);

This:

double* li_tmp = NULL;  // li_tmp is now global

void merge(double* li,int left1,int right1,int left2,int right2,int size){

    if (li_tmp == NULL) {
        li_tmp = (double *)malloc(sizeof(double)*N); // allocate for size N just once
    }

Then remove the free statement at the bottom of the merge function.

Instead of this:

    for(i=left1; i<right2+1; i++){
        li[i] = li_tmp[i];
    }
    free(li_tmp); 
}

Just this:

    for(i=left1; i<right2+1; i++){
        li[i] = li_tmp[i];
    }
    
}

Then free li_tmp elsewhere after the merge-sort has run to completion.

As to why different sizes for N cause different perf changes, I don't think that's worth getting into without these optimizations and compiler switches applied. The most likely hypothesis is that the array sizes caused by different sizes of N, trigger more "paging" between cache memory and main RAM. Or these large block allocations stress the memory manager in different ways.

Upvotes: 3

Related Questions