Reputation: 215

Parallel computing of array elements on GPU

I am creating a database using C#. The problem is I have close to 4 million datapoints and it takes a lot of time to complete the database (maybe several month). The code looks like something like this.

int[,,,] Result1=new int[10,10,10,10];
int[,,,] Result2=new int[10,10,10,10];
int[,,,] Result3=new int[10,10,10,10];
int[,,,] Result4=new int[10,10,10,10];

for (int i=0;i<10;i++)
{
  for (int j=0;j<10;j++)
  {
    for (int k=0;k<10;k++)
    {
      for (int l=0;l<10;l++)
      {
        Result1[i,j,k,l]=myFunction1(i,j,k,l);
        Result2[i,j,k,l]=myFunction2(i,j,k,l);
        Result3[i,j,k,l]=myFunction3(i,j,k,l);
        Result4[i,j,k,l]=myFunction4(i,j,k,l);
      }
    }
  }
}

All the elements of the Result array are completely independent of each other. My PC has 8 cores and I have create a thread for each of myFunction methods, but still the whole process would take a lot simply because there are many cases. I am wondering if there is any way to run this on GPU rather than CPU. I have not done it before and I do not know how its gonna work. I do appreciate if someone can help me on this.

Upvotes: 2

Answers (3)

Matas Vaitkevicius

Reputation: 61499

I don't think your code example is using all eight cores - only one. Following should use all 8:

 private void Para()
    {
        int[, , ,] Result1 = new int[10, 10, 10, 10];
        int[, , ,] Result2 = new int[10, 10, 10, 10];
        int[, , ,] Result3 = new int[10, 10, 10, 10];
        int[, , ,] Result4 = new int[10, 10, 10, 10];

        Parallel.For(0L, 10, i =>
        {
            Parallel.For(0L, 10, j =>
            {
                Parallel.For(0L, 10, k =>
                {
                    Parallel.For(0L, 10, l =>
                    {
                        Result1[i, j, k, l] = myFunction1(i, j, k, l);
                        Result2[i, j, k, l] = myFunction2(i, j, k, l);
                        Result3[i, j, k, l] = myFunction3(i, j, k, l);
                        Result4[i, j, k, l] = myFunction4(i, j, k, l);
                    });
                });
            });
        });
    }

If this isn't sufficient have a look at Cudafy should make your live easier than rewriting all your complicated functions in C++.

Upvotes: 0

Kris Vandermotten

Reputation: 10201

You could consider rewriting this part of your application using C++ AMP, and call it from your .NET code. For more information, see http://blogs.msdn.com/b/nativeconcurrency/archive/2012/08/30/learn-c-amp.aspx

However, in the code you show there are 40,000 datapoints, not 4,000,000.

There are about 2.6 million seconds in a month. For 40,000 datapoints, that gives you over a minute per datapoint. (Even if you did have 4 million data points, it would still be well over half a second per data point.) I don't know what those functions are doing, but I'd be surprised that something that needs to run that long is a good candidate to be run on a GPU.

Maybe you need to revisit the algorithms used in those functions, to see if they can be optimized. You may even have to reconsider your idea to calculate each datapoint independently from the others. Are you sure that one result cannot be more efficiently computed if you know some other results already?

UPDATE:

What I mean by this last remark, is that there may be repeated calculation going on. For example, if a part of the calculations done by myFunction1 only depend on the first two parameters, you could restructure your code as follows:

for (int i = 0; i < 10; i++)
{
  for (int j = 0; j < 10; j++)
  {
    var commonPartValue = commonPart(i, j);

    for (int k = 0; k < 10; k++)
    {
      for (int l = 0; l < 10; l++)
      {
        Result1[i, j, k, l] = myFunction1b(i, j, k, l, commonPartValue);
      }
    }
  }
}

The net effect would be that you calculate this 'common part' once, where you used to do it a hundred times.

An other case is where you can calculate a result more efficiently using the previous result, than if you would have to do it from scratch. For example, n² can be easily calculated as n * n, but if you know (n - 1)² already, than n² = (n - 1)² + 2 * n - 1. In integer arithmatic, this means you replace a multiplication by a shift and a decrement, which is faster.

Now, I'm not claiming your problem is as simple as these examples, but I am saying that you should look for these kinds of optimizations first, before looking for better compilers or different hardware.

Also, as a sidenote: I am assuming that you store what you have calculated on disk, not in an array in RAM. I wouldn't want to wait a month for the results to show, and then have a power cut...

Upvotes: 1

zinking

Reputation: 5695

yes, the intuition for these scenario is to use multi-thread / even GPUs to accelerate. But the important thing is to figure out whether the scenario is suited for parallel computation.

As you suggested that these datasets are independent of each other, but when you run multi-threaded version on 8 cores there's no obvious improvement : this suggests potential issues: either your statement about the independence of the dataset is wrong or your implementation of the multi-threaded code is not optimized. I would suggest you tune your code first to see improvement and then seek methods to transplant this to GPU plat forms.

or you can take a look at OPENCL which is intended for both parallel threads / GPU cores. but the important thing is to figure out whether your question is really suited for parallel computing

Upvotes: 1

Parallel computing of array elements on GPU

Answers (3)

Related Questions