array traversing: parallel performance is slower than non parallel

Question

In my program i want to detemine how many numbers have 9 digits how, many have 8 digits, etc. with this loop:

for (int i = 0; i < 60000000; i++)
        {
            if (a[i] >= 1000000000) { p[10] += 1; }
            else if (a[i] >= 100000000) { p[9] += 1; }
            else if (a[i] >= 10000000) { p[8] += 1;  }
            else if (a[i] >= 1000000) { p[7] += 1;  }
            else if (a[i] >= 100000) { p[6] += 1;  }
            else if (a[i] >= 10000) { p[5] += 1; }
            else if (a[i] >= 1000) { p[4] += 1;  }
            else if (a[i] >= 100) { p[3] += 1;  }
            else if (a[i] >= 10) { p[2] += 1;  }
            else { p[1] += 1; }
        }

I parallelized the loop like this:

void partiton(int f, int l, int[] p)
    {
        Parallel.Invoke(()=>calc(f,l/2,p),()=>calc(l/2,l,p));
    }

    void calc(int f, int l, int[] p)
    {
        for (int i = f; i < l; i++)
        {
            if (a[i] >= 1000000000) { p[10] += 1; }
            else if (a[i] >= 100000000) { p[9] += 1; }
            else if (a[i] >= 10000000) { p[8] += 1;  }
            else if (a[i] >= 1000000) { p[7] += 1;  }
            else if (a[i] >= 100000) { p[6] += 1;  }
            else if (a[i] >= 10000) { p[5] += 1; }
            else if (a[i] >= 1000) { p[4] += 1;  }
            else if (a[i] >= 100) { p[3] += 1;  }
            else if (a[i] >= 10) { p[2] += 1;  }
            else { p[1] += 1; }
        }
    }
private void button1_Click(object sender, EventArgs e)
    {
        Stopwatch w = new Stopwatch();
        w.Restart();
        int f = 0;
        int l = 60000000;
        Parallel.Invoke(() => calc(f, l/2, p), () => calc(l/2, l, p));
        w.Stop();
        label1.Text = w.Elapsed.ToString();

    }

But the benchmark is: Sequential : 0.3834 Parallel : 0.6864

Why is the parallel code slower? Is there something wrong with my code? My cpu is AMD Phenom™ II X4. Model, 955.

Der Kommissar · Accepted Answer

It's all in the variables.

Take, for instance, your p object. You pass the same p object to both threads. Now, I'm not sure if Parallel.Invoke is capable of detecting this, and as such is executing them in serial (albeit with significant overhead) or not, but I do know that if it is not detecting this, then you have a lot of attempts to read/write the same value in the same thread.

Now, I built a small, concrete example using your code as a base, and here's a copy of it. (Paste into any new console project, rename _Main to Main and run as you see fit.)

static int[] a = new int[100000000];
static void calc(int f, int l, int[] p, int[] a)
{
    for (int i = f; i < l; i++)
    {
        if (a[i] >= 1000000000) { p[10] += 1; }
        else if (a[i] >= 100000000) { p[9] += 1; }
        else if (a[i] >= 10000000) { p[8] += 1; }
        else if (a[i] >= 1000000) { p[7] += 1; }
        else if (a[i] >= 100000) { p[6] += 1; }
        else if (a[i] >= 10000) { p[5] += 1; }
        else if (a[i] >= 1000) { p[4] += 1; }
        else if (a[i] >= 100) { p[3] += 1; }
        else if (a[i] >= 10) { p[2] += 1; }
        else { p[1] += 1; }
    }
}
public static void _Main(string[] args)
{
    for (int i = 0; i < a.Length; i++)
    {
        a[i] = i;
    }

    int f = 0;
    int l = a.Length;
    int[] p = new int[10];
    int[] p1 = new int[10];
    int[] p2 = new int[10];
    int[] p3 = new int[10];
    int[] p4 = new int[10];

    int[] a1 = new int[l / 2];
    int[] a2 = new int[l / 2];

    int[] a11 = new int[l / 4];
    int[] a12 = new int[l / 4];
    int[] a13 = new int[l / 4];
    int[] a14 = new int[l / 4];

    for (int i = 0; i < a.Length; i++)
        if (i >= l / 2)
            a2[i - l / 2] = a[i];
        else
            a1[i] = a[i];

    for (int i = 0; i < a.Length; i++)
        if (i >= l / 4 * 3)
            a14[i - l / 4 * 3] = a[i];
        else if (i >= l / 4 * 2)
            a13[i - l / 4 * 2] = a[i];
        else if (i >= l / 4 * 1)
            a12[i - l / 4] = a[i];
        else
            a14[i] = a[i];

    int rc = 5;
    for (int d = 0; d < rc; d++)
    {
        Stopwatch w = new Stopwatch();
        w.Start();
        Parallel.Invoke(() => calc(f, l / 2, p1, a1), () => calc(f, l / 2, p2, a2));
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 1, d, w.ElapsedMilliseconds);
        w.Reset();
        w.Start();
        Parallel.Invoke(() => calc(f, l / 4, p1, a11), () => calc(f, l / 4, p2, a12), () => calc(f, l / 4, p3, a13), () => calc(f, l / 4, p4, a14));
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 2, d, w.ElapsedMilliseconds);
        w.Reset();
        w.Start();
        Parallel.Invoke(() => calc(f, l / 2, p, a), () => calc(l / 2, l, p, a));
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 3, d, w.ElapsedMilliseconds);
        w.Reset();
        w.Start();
        calc(f, l, p, a);
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 4, d, w.ElapsedMilliseconds);
    }
}

I'm sure there are more optimizations I could run. (Converting the ifs to while loops, for example.) I also cannot guarantee the accuracy of it. I merely took your logic and adopted proper debugging to it.

But when I ran this exact example on my PC, I had the following results:

Attempt 1 took an average 327.8ms. This attempt splits the a and p variables into two separate variables for each.
Attempt 2 took an average 306ms. This attempt splits the a and p variables into four separate variables for each.
Attempt 3 took an average 703ms. This is the exact same thing you were doing. (Albeit with a local variable on the calc method.
Attempt 4 took an average 347.6ms. This is running the calc method synchronously.

Why so big a difference? Attempts 1 and 2 split the processed data into variables that do not require threads to synchronize, whereas Attempt 3 forces the two threads to use the same variables, creating clashes, and as Ron Beyer said, context switches.

Basically, if you are going to try to write to the same anything in parallel, you should localize the data each thread is writing and merge changes in the end.

array traversing: parallel performance is slower than non parallel

Answers (2)

Related Questions