OpenCL: 2-10x run time summing columns rather that rows of square array?

Question

I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array of data, I get different run times by a factor of anywhere from 2 to 10x. Strangely, the contiguous access appear slower.

I'm using PyOpenCL to test.

Here's the two kernels of interest (reduce and reduce2), and another that's generating the data feeding them (forcesCL):

kernel void forcesCL(global float4 *chrgs, global float4 *chrgs2,float k, global float4 *frcs)
{
    int i=get_global_id(0);
    int j=get_global_id(1);
    int idx=i+get_global_size(0)*j;

    float3 c=chrgs[i].xyz-chrgs2[j].xyz;
    float l=length(c);
    frcs[idx].xyz= (l==0 ? 0 
                         : ((chrgs[i].w*chrgs2[j].w)/(k*pown(l,2)))*normalize(c));
    frcs[idx].w=0;
}

kernel void reduce(global float4 *frcs,ulong k,global float4 *result)
{
    ulong gi=get_global_id(0);
    ulong gs=get_global_size(0);
    float3 tmp=0;

    for(ulong i=0;i



It's the reduce kernels that are of interest here.  The forcesCL kernel just estimates the Lorenz force between two charges where the position of each is encoded in the xyz component of a float4, and the charge in the w component.  The physics isn't important, it's just a toy to play with OpenCL.

I won't go through the PyOpenCL setup unless asked, other than to show the build step:

program=cl.Program(context,'
'.join((src_forcesCL,src_reduce,src_reduce2))).build()


I then setup of the arrays with random positions and the elementary charge:

a=np.random.rand(10000,4).astype(np.float32)
a[:,3]=np.float32(q)
b=np.random.rand(10000,4).astype(np.float32)
b[:,3]=np.float32(q)


Setup scratch space and allocations for return values:

c=np.empty((10000,10000,4),dtype=np.float32)
cc=cl.Buffer(context,cl.mem_flags.READ_WRITE,c.nbytes)
r=np.empty((10000,4),dtype=np.float32)
rr=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,r.nbytes)
s=np.empty((10000,4),dtype=np.float32)
ss=cl.Buffer(context,cl.mem_flags.WRITE_ONLY,s.nbytes)


Then I try to run this in each of two ways-- once using reduce(), and once reduce2().  The only difference should be in which axis I'm summing:

%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)

evt1=program.forcesCL(queue,c.shape[0:2],None,aa,bb,k,cc)
evt2=program.reduce(queue,r.shape[0:1],None,cc,np.uint32(b.shape[0:1]),rr,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,r,rr,wait_for=[evt2],is_blocking=True)


Notice I've swapped the arguments to forcesCL so I can check the results against the first method:

%%timeit
aa=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=a)
bb=cl.Buffer(context,cl.mem_flags.READ_ONLY|cl.mem_flags.COPY_HOST_PTR,hostbuf=b)

evt1=program.forcesCL(queue,c.shape[0:2],None,bb,aa,k,cc)
evt2=program.reduce2(queue,s.shape[0:1],None,cc,np.uint32(a.shape[0:1]),ss,wait_for=[evt1])
evt4=cl.enqueue_copy(queue,s,ss,wait_for=[evt2],is_blocking=True)


The version using the reduce() kernel gives me times on the order of 140ms, the version using the reduce2() kernel gives me times on the order of 360ms.  The values returned are the same, save a sign change because they're being subtracted in the opposite order.  

If I do the forcesCL step once, and run the two reduce kernels, the difference is much more pronounced-- on the order of 30ms versus 250ms.

I wasn't expecting any difference, but if I were I would have expected the contiguous accesses to perform better, not worse.

Can anyone give me some insight into what's happening here?

Thanks!

DarkZeros · Accepted Answer

This is a clear case of example of coalescence. It is not about how the index is used (in the rows or columns), but how the memory is accessed in the HW. Just a matter of following step by step how the real accesses are performed and in which order.

Lets analyze it properly:

Suppose Work-Items are divided in local blocks of size N.

For the first case:

WI_0 will read 0, Gs, 2Gs, 3Gs, .... (k-1)Gs
WI_1 will read 1, Gs+1, 2Gs+1, 3Gs+1, .... (k-1)Gs+1
...

Since each of this WI is run in parallel, their memory accesses occur at the same time. So, the memory controller is requested:

First iteration: 0, 1, 2, 3 ... N-1  -> Groups into few memory access
Second iteration: Gs, Gs+1, Gs+2, ... Gs+N-1  ->  Groups into few memory access
...

In that case, in each iteration, the memory controller groups all the N WI requests into a big 256bits reads/writes to global. There is no need to cache, since after processing the data it can be discarded.

For the second case:

WI_0 will read 0, 1, 2, 3, .... (k-1)
WI_1 will read Gs, Gs+1, Gs+2, Gs+3, .... Gs+(k-1)   
...

The memory controller is requested:

First iteration: 0, Gs, 2Gs, 3Gs -> Scattered read, no grouping
Second iteration: 1, Gs+1, 2Gs+1, 3Gs+1 ->Scattered read, no grouping
...

In this case, the memory controller is not working in a proper mode. It would work if the cache memory is infinite, but it is not. It can cache some reads thanks to the inter-workitems memory requested being the same sometimes (due to the k size for loop) but not all of them.

When you reduce the value of k, you reduce the amount of cache reuses that is possibleI. And this lead to even bigger differences between the column and row access modes.

OpenCL: 2-10x run time summing columns rather that rows of square array?

Answers (1)

Related Questions