Vivilar
Vivilar

Reputation: 97

Efficiently multiply OpenCL vector components?

I have a float8 vector type that I multiply the components of the vector using vector component addressing as follows ( note the variable v below isn't a constant in reality);

float8 v = (float8) (1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f);
float result = v.s0 * v.s1 * v.s2 * v.s3 * v.s4 * v.s5 * v.s6 * v.s7;

However this prevents my kernel from being vectorised when being compiled with Intel Code builder.

Device build started
Device build done
Kernel <test> was not vectorized

To over come this I started to create copies of the vector, masking the required components and multiplying them all together before trying to call the dot function however this all seemed rather inefficient and convoluted.

My question is therefore how can I multiply the components of my vector in a efficient vectorised manor?

Upvotes: 0

Views: 977

Answers (1)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11920

My comment was wrong as it is not a dot product you need in the result. It is simply a multiplication of 8 numbers. Parallel work data should be parallel, not in same container. If you want to multiply s0 s1 s2 ... s7 then you put them in consecutive vector variables

variable-1:  s0 p0 r0 q0 .... z0
variable-2:  s1 p1 r1 q1 .... z1

variable-8:  s7 p7 ....       z7

you can multiply those at SIMD speed and have 8 multiplications at a time using float8 type and continue as many times as you need, not just 8.

At each multiplication, you have responsibility to check for errors and overflows. But when hardware does 8 multiplications in a single instruction, which order do you want? You want them multiplied in increasing index order(serial,slow) or something like a pairwise multiplication on tree elements(less multiplications,faster,but give different results)? Order of operations may be important sometimes.

If it is a gpu, simply multiply items and instruction level parallelism + hyperthread engine of gpu achieves efficiency. If it is cpu, you should first check if your cpu supports vertical multiplication instructions(I doubt such thing exists), if not then you need to multiply on array elements not vector elements. This should be easier to vectorise as it is a continuous data on main memory since a cpu does not give explicit control on local memory.

Upvotes: 1

Related Questions