Reputation: 11920
Problem:
In Opencl 1.2, there is no built-in function like
long sum_int4_elements_into_long(int4 v);
What I've tried:
So I'm using the code below (prefetched inner part of an unrolled loop)
// acc is long
int4 tmp=arr[i+7];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+6];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+5];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+4];"
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+3];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+2];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i+1];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
tmp=arr[i];
acc+=tmp.x+tmp.y+tmp.z+tmp.w;
to sum(reduction) all elements of an integer array (int4 arr) into a single long variable with a speed-up of only +%20 to +%30 compared to serial code. If it could enable SSE or AVX, it would be much faster.
also tried:
using pure integer accumulator speeds the summing operation up by 3x but integer overflows so I can only use a long variable. Then I tried using a long4 and long2 variable as
// acc is a long4, arr is an int4 array
acc+=convert_long4(arr[...])
// acc is a long2, arr is an int2 array
acc+=convert_long2(arr[...])
but it failed and locked the computer(checked indexing and limits, there is no problem) so there must be a problem with longn to intn under the hardware instruction mappings with AMD CPU.
Clarification:
There must be some equivalent SIMD(SSE or AVX) assembly instruction of AMD and INTEL for
32 bit integers in a 128-bit SSE register
| | | |
acc+=tmp.x + tmp.y + tmp.z + tmp.w
^
|
64-bit register
but somehow opencl doesnt have mapping to this or I did not pack C99 instructions well enough so cl-compiler couldnt use SSE/AVX instructions.
The closest built-in float version is
acc=dot(v,(float4)(1,1,1,1));
but I need an integer version of this, because fp needs Kahan's Summation Correction which needs extra time.
Edit:
Im not sure if int+int+int+int will have a proper long result or just have an overflowed int into a long.
Opencl version: 1.2 running on CPU (Java ----> Jocl)
CPU: AMD fx-8150
OS:64 bit windows 10
Driver: latest one by amd.
Date: 23.09.2015
For comparison:
16M 32-bit integers, FX8150 @ 3300MHz (using 6 cores out of 8)
Serial code on java 1.8_25 takes 16.5 ms on average.
IntStream of Java-1.8 takes 13.5 ms on average (X.reduce(0, Integer::sum))
Parallel code in this question takes 12.5 ms on average (using single workgroup)
Parallel code in this question takes 5.8 ms on average (using four workgroups)
Parallel but overflowing non-long version takes 5ms. (hitting memory bandwidth)
mfa's answer:
acc=dot(v,(double4)(1,1,1,1));
takes 13.5 ms on average but float version takes 12.2 ms on average.
Im not sure if a float can keep its precision always to add 1.0 (or even 0.0) to a very big fp number.
Upvotes: 3
Views: 4371
Reputation: 8410
What is the speed of doing reductions? Maybe is not so bad after all.
long4 lv = (long4)v;
long2 t = lv.xy + lv.zw;
acc += t.x + t.y;
Also, if what you really want is to reduce multiple items, not a single int4. Then sum them in long4 space and then reduce the last one only.
long4 sums = long4(0);
sums += convert_long4(arr[0]);
sums += convert_long4(arr[1]);
sums += convert_long4(arr[2]);
...
sums += convert_long4(arr[N-1]);
sums += convert_long4(arr[N]);
long2 t = sums.xy + sums.zw;
long res = t.x + t.y;
NOTE: If this is the only operation you are doing memory bottlenecks are likely to be the main problem here. So measuring the kernel execution time is going to give a highly biased result.
Upvotes: 3
Reputation: 5087
The sum of 4 ints will fit into a double precision float. Have you given this a try?
double acc;
acc=dot(v,(double4)(1,1,1,1));
Could you post the timing for this too please?
EDIT: adding more info.
Double version took 13.2 ms on average while float version took 12.2 ms on average but Im not sure if float addition keeps integers' quantum steps always. Could its precisiton at big floats be enough for it to add 1.0 or 0.0 ?
The extra precision could definitely add the extra 1ms. On some older AMD GPU hardware, double operations effectively take twice as long because they actually use two float registers together. The slight reduction in performance you are measuring can also be accounted for when you consider that mathematically speaking a double precision operation is up to 8 individual single precision ops combined. Overall, I think that your CPU is doing a decent job with doubles relative to floats.
Single precision won't overflow if your sum remains under 24 bits long. More about this here. Doubles allow 54 bits of precision (here). Perhaps it is worth having a separate 'small' kernel for when you know the sum will be small?
Upvotes: 1