Incremental parallel sum on a large dataset using OpenCL

Question

I have a 3-D vector F that currently is filled sequentially like this:

// for each particle
for(int p = 0; p < pmax; p++) {
    // and each dimension (n x m x o)
    for(int i = 0; i < n; i++) {
        for(int j = 0; j < m; j++) {
            for(int k = 0; k < o; k++) {
                // do some calulations depending on p, i, j and k and add onto F
                F[i][j][k] += p * i * j * k;
            }   
        }
    }
}

The value of F is calculated incrementally for each particle.

Now I wanted to speed things up using OpenCL. My first attempt was to parallelize the inner loops with a 3D-Range-Kernel which gets called for each particle. That indeed works fine but is even slower on the GPU than on the CPU (without applying any optimizations - e.g. F is copied from host to GPU and back again in every iteration).

In the second attempt I tried to parallelize the outer loop which (I think) is a better idea than the first attempt since pmax is expected to be quite larger than m * n * o. From what I read I think the problem can be solved using a parallel sum reduction, which of there are plenty examples. However for this approach I need to have a copy of F for each thread (or work-item) which does not fit in memory (getting a CL_OUT_OF_RESOURCES here).

My question now is: Is it even possible to parallelize such an incremental sum without having F stored in memory multiple times for several particles? If yes, what would a good approach look like? Should I instead stick to my first attempt and try to optimize it?

Note that I'm new to OpenCL and don't have that much knowledge of parallelization techniques in general. I'd appreciate any hints or references to helpful lectures or examples, thanks!

Btw my googling sessions on this topic only lead me to the calculation of a prefix sum.

doqtor · Accepted Answer

You could start from simple 3 dimensional kernel like this:

import numpy as np
import pyopencl as cl
import time

m=32
n=32
o=32
pmax=16

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

t1=time.time()
F=np.zeros((m,n,o)).astype(np.int32).flatten(order='F')

mf = cl.mem_flags
F_buf = cl.Buffer(ctx, mf.WRITE_ONLY, F.nbytes)

prg = cl.Program(ctx, """
    __kernel void calc(int pmax, __global int *F) {

    int i = get_global_id(0);
    int j = get_global_id(1);
    int k = get_global_id(2);
    int m = get_global_size(0);
    int n = get_global_size(1);
    int o = get_global_size(2);

    int tmp = i * j * k;
    int sum = 0;
    for(int p = 0; p < pmax; p++) 
        sum += p * tmp;
    F[i*n*o+j*o+k] = sum;
    }
    """).build()

prg.calc(queue, (m,n,o), None, np.int32(pmax), F_buf)
cl.enqueue_copy(queue, F, F_buf)

F=F.reshape((m,n,o), order='F')
F=np.transpose(F, (2, 1, 0))
t2=time.time()

t3=time.time()    
Ftest=np.zeros((m,n,o)).astype(np.int32)
for p in range(pmax):
    for i in range(m):
        for j in range(n):
            for k in range(o):
                Ftest[i][j][k] += p*i*j*k

t4=time.time()

print "OpenCL time:", (t2-t1)
print "CPU time:", (t4-t3)

Test results:

$ python test.py 
Choose platform:
[0] 
[1] 
Choice [0]:
Set the environment variable PYOPENCL_CTX='' to avoid being asked again.
OpenCL time: 0.0124819278717
CPU time: 1.03352808952

$ python test.py 
Choose platform:
[0] 
[1] 
Choice [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
OpenCL time: 0.0148649215698
CPU time: 1.11784911156

Depending on the size of the matrix the performance may be different. Also it doesn't mean it will be faster than your current CPU implementation. In OpenCL performance depends on many factors like how much data is to be transferred to the device, whether is there enough computation to be done for it to make sense, how the data is accessed in the kernel: in which order by work item(s), which type of the memory - global, local, registers - and so on.

Incremental parallel sum on a large dataset using OpenCL

Answers (1)

Related Questions