hemmat
hemmat

Reputation: 15

array traversing: parallel performance is slower than non parallel

In my program i want to detemine how many numbers have 9 digits how, many have 8 digits, etc. with this loop:

for (int i = 0; i < 60000000; i++)
        {
            if (a[i] >= 1000000000) { p[10] += 1; }
            else if (a[i] >= 100000000) { p[9] += 1; }
            else if (a[i] >= 10000000) { p[8] += 1;  }
            else if (a[i] >= 1000000) { p[7] += 1;  }
            else if (a[i] >= 100000) { p[6] += 1;  }
            else if (a[i] >= 10000) { p[5] += 1; }
            else if (a[i] >= 1000) { p[4] += 1;  }
            else if (a[i] >= 100) { p[3] += 1;  }
            else if (a[i] >= 10) { p[2] += 1;  }
            else { p[1] += 1; }
        }

I parallelized the loop like this:

void partiton(int f, int l, int[] p)
    {
        Parallel.Invoke(()=>calc(f,l/2,p),()=>calc(l/2,l,p));
    }

    void calc(int f, int l, int[] p)
    {
        for (int i = f; i < l; i++)
        {
            if (a[i] >= 1000000000) { p[10] += 1; }
            else if (a[i] >= 100000000) { p[9] += 1; }
            else if (a[i] >= 10000000) { p[8] += 1;  }
            else if (a[i] >= 1000000) { p[7] += 1;  }
            else if (a[i] >= 100000) { p[6] += 1;  }
            else if (a[i] >= 10000) { p[5] += 1; }
            else if (a[i] >= 1000) { p[4] += 1;  }
            else if (a[i] >= 100) { p[3] += 1;  }
            else if (a[i] >= 10) { p[2] += 1;  }
            else { p[1] += 1; }
        }
    }
private void button1_Click(object sender, EventArgs e)
    {
        Stopwatch w = new Stopwatch();
        w.Restart();
        int f = 0;
        int l = 60000000;
        Parallel.Invoke(() => calc(f, l/2, p), () => calc(l/2, l, p));
        w.Stop();
        label1.Text = w.Elapsed.ToString();

    }

But the benchmark is: Sequential : 0.3834 Parallel : 0.6864

Why is the parallel code slower? Is there something wrong with my code? My cpu is AMD Phenom™ II X4. Model, 955.

Upvotes: 0

Views: 411

Answers (2)

Der Kommissar
Der Kommissar

Reputation: 5953

It's all in the variables.

Take, for instance, your p object. You pass the same p object to both threads. Now, I'm not sure if Parallel.Invoke is capable of detecting this, and as such is executing them in serial (albeit with significant overhead) or not, but I do know that if it is not detecting this, then you have a lot of attempts to read/write the same value in the same thread.

Now, I built a small, concrete example using your code as a base, and here's a copy of it. (Paste into any new console project, rename _Main to Main and run as you see fit.)

static int[] a = new int[100000000];
static void calc(int f, int l, int[] p, int[] a)
{
    for (int i = f; i < l; i++)
    {
        if (a[i] >= 1000000000) { p[10] += 1; }
        else if (a[i] >= 100000000) { p[9] += 1; }
        else if (a[i] >= 10000000) { p[8] += 1; }
        else if (a[i] >= 1000000) { p[7] += 1; }
        else if (a[i] >= 100000) { p[6] += 1; }
        else if (a[i] >= 10000) { p[5] += 1; }
        else if (a[i] >= 1000) { p[4] += 1; }
        else if (a[i] >= 100) { p[3] += 1; }
        else if (a[i] >= 10) { p[2] += 1; }
        else { p[1] += 1; }
    }
}
public static void _Main(string[] args)
{
    for (int i = 0; i < a.Length; i++)
    {
        a[i] = i;
    }

    int f = 0;
    int l = a.Length;
    int[] p = new int[10];
    int[] p1 = new int[10];
    int[] p2 = new int[10];
    int[] p3 = new int[10];
    int[] p4 = new int[10];

    int[] a1 = new int[l / 2];
    int[] a2 = new int[l / 2];

    int[] a11 = new int[l / 4];
    int[] a12 = new int[l / 4];
    int[] a13 = new int[l / 4];
    int[] a14 = new int[l / 4];

    for (int i = 0; i < a.Length; i++)
        if (i >= l / 2)
            a2[i - l / 2] = a[i];
        else
            a1[i] = a[i];

    for (int i = 0; i < a.Length; i++)
        if (i >= l / 4 * 3)
            a14[i - l / 4 * 3] = a[i];
        else if (i >= l / 4 * 2)
            a13[i - l / 4 * 2] = a[i];
        else if (i >= l / 4 * 1)
            a12[i - l / 4] = a[i];
        else
            a14[i] = a[i];

    int rc = 5;
    for (int d = 0; d < rc; d++)
    {
        Stopwatch w = new Stopwatch();
        w.Start();
        Parallel.Invoke(() => calc(f, l / 2, p1, a1), () => calc(f, l / 2, p2, a2));
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 1, d, w.ElapsedMilliseconds);
        w.Reset();
        w.Start();
        Parallel.Invoke(() => calc(f, l / 4, p1, a11), () => calc(f, l / 4, p2, a12), () => calc(f, l / 4, p3, a13), () => calc(f, l / 4, p4, a14));
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 2, d, w.ElapsedMilliseconds);
        w.Reset();
        w.Start();
        Parallel.Invoke(() => calc(f, l / 2, p, a), () => calc(l / 2, l, p, a));
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 3, d, w.ElapsedMilliseconds);
        w.Reset();
        w.Start();
        calc(f, l, p, a);
        w.Stop();
        Console.WriteLine("Attempt {0}/{1}: {2}", 4, d, w.ElapsedMilliseconds);
    }
}

I'm sure there are more optimizations I could run. (Converting the ifs to while loops, for example.) I also cannot guarantee the accuracy of it. I merely took your logic and adopted proper debugging to it.

But when I ran this exact example on my PC, I had the following results:

  1. Attempt 1 took an average 327.8ms. This attempt splits the a and p variables into two separate variables for each.
  2. Attempt 2 took an average 306ms. This attempt splits the a and p variables into four separate variables for each.
  3. Attempt 3 took an average 703ms. This is the exact same thing you were doing. (Albeit with a local variable on the calc method.
  4. Attempt 4 took an average 347.6ms. This is running the calc method synchronously.

Why so big a difference? Attempts 1 and 2 split the processed data into variables that do not require threads to synchronize, whereas Attempt 3 forces the two threads to use the same variables, creating clashes, and as Ron Beyer said, context switches.

Basically, if you are going to try to write to the same anything in parallel, you should localize the data each thread is writing and merge changes in the end.

Upvotes: 4

Viktor Peller
Viktor Peller

Reputation: 646

  • This code will not give you the right numbers because it increases the same variables from multiple threads without synchronization. When different CPU cores work on the same variable, every core has its own version, and a modification of this version does not flow to other caches immediately. Because of that, other cores work on the old version. For example, one core might have increased p[0] from 0 to 1, but the other core still believes that it is 0. So when it increases it, the value becomes 1 again. Later this 1 will appear in the main memory instead of 2.
  • To answer your questions, the problem is that you use the same block of memory from two threads, and it slows down the memory access. Data is usually cached, but when one core writes the memory area, the other cores sooner or later detect this, and they need to reload it from the main memory which is slow. (and the sooner or later is important to you, it doesn't happen immediately, so you need synchronization, which makes everything even slower when you do it right). Because of these re-fetchings, the multithreaded version is slower.

When you try to make an algorithm multithreaded, you need to try to separate the tasks the way they don't use shared memory. As a kind of micro optimization - which is BAD - you can try to allocate the memory the way that they are not next to each other, otherwise the before mentioned cache problem will slow down the processing.

Upvotes: 1

Related Questions