Opencl, sum of integer vector elements

Question

Problem:

In Opencl 1.2, there is no built-in function like

long sum_int4_elements_into_long(int4 v);

What I've tried:

So I'm using the code below (prefetched inner part of an unrolled loop)

// acc is long
int4 tmp=arr[i+7];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+6];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+5];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+4];"
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+3];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+2];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+1];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;

to sum(reduction) all elements of an integer array (int4 arr) into a single long variable with a speed-up of only +%20 to +%30 compared to serial code. If it could enable SSE or AVX, it would be much faster.

also tried:

using pure integer accumulator speeds the summing operation up by 3x but integer overflows so I can only use a long variable. Then I tried using a long4 and long2 variable as

   // acc is a long4, arr is an int4 array
   acc+=convert_long4(arr[...]) 

  //  acc is a long2, arr is an int2 array
   acc+=convert_long2(arr[...])

but it failed and locked the computer(checked indexing and limits, there is no problem) so there must be a problem with longn to intn under the hardware instruction mappings with AMD CPU.

Clarification:

There must be some equivalent SIMD(SSE or AVX) assembly instruction of AMD and INTEL for

        32 bit integers in a 128-bit SSE register
         |       |       |      |        
  acc+=tmp.x + tmp.y + tmp.z + tmp.w
   ^    
   | 
  64-bit register

but somehow opencl doesnt have mapping to this or I did not pack C99 instructions well enough so cl-compiler couldnt use SSE/AVX instructions.

The closest built-in float version is

 acc=dot(v,(float4)(1,1,1,1));

but I need an integer version of this, because fp needs Kahan's Summation Correction which needs extra time.

Edit:

Im not sure if int+int+int+int will have a proper long result or just have an overflowed int into a long.

Opencl version: 1.2 running on CPU (Java ----> Jocl)

CPU: AMD fx-8150

OS:64 bit windows 10

Driver: latest one by amd.

Date: 23.09.2015

For comparison:

16M 32-bit integers, FX8150 @ 3300MHz (using 6 cores out of 8)

Serial code on java 1.8_25 takes 16.5 ms on average.

IntStream of Java-1.8 takes 13.5 ms on average (X.reduce(0, Integer::sum))

Parallel code in this question takes 12.5 ms on average (using single workgroup)

Parallel code in this question takes 5.8 ms on average (using four workgroups)

Parallel but overflowing non-long version takes 5ms. (hitting memory bandwidth)

mfa's answer:

 acc=dot(v,(double4)(1,1,1,1));

takes 13.5 ms on average but float version takes 12.2 ms on average.

Im not sure if a float can keep its precision always to add 1.0 (or even 0.0) to a very big fp number.

DarkZeros · Accepted Answer

What is the speed of doing reductions? Maybe is not so bad after all.

long4 lv = (long4)v;
long2 t = lv.xy + lv.zw;
acc += t.x + t.y;

Also, if what you really want is to reduce multiple items, not a single int4. Then sum them in long4 space and then reduce the last one only.

long4 sums = long4(0);
sums += convert_long4(arr[0]);
sums += convert_long4(arr[1]);
sums += convert_long4(arr[2]);
...
sums += convert_long4(arr[N-1]);
sums += convert_long4(arr[N]);
long2 t = sums.xy + sums.zw;
long res = t.x + t.y;

NOTE: If this is the only operation you are doing memory bottlenecks are likely to be the main problem here. So measuring the kernel execution time is going to give a highly biased result.

Opencl, sum of integer vector elements

Answers (2)

Related Questions