Reputation: 765
Since 2 years, I am developing a library: cyme to perform SIMD computation over "friendly container". I am able to reach the maximum performance of the processor. Typically user defined container and write a kernel under the following syntax (trivial example):
for(i...)
W[i] = R[i]+R[i]+R[i]+R[i]+R[i];
R[i]+R[i]+ ... perform the operations using SIMD registers. I have a precise control of the generation of the asm (using template expression). I am fully satisfied, however I am exploring the Power architecture since a few days. Power7 processor has 4 floating point unit and one vector unit (from wikipedia I read:"The POWER7 processor has an Instruction Sequence Unit that is capable of dispatching up to six instructions per cycle to a set of queues").
My idea was to generate ASM combining serial and vector instructions, thus I may be able to use the 5 units simultaneously. I did it, and my pb starts now:
The first ASM version of the previous code, pure SIMD-Power is:
.L536:
lxvd2x 0,0,9
stxvd2x 0,1,31
lxvd2x 12,0,9
stxvd2x 12,1,30
xvadddp 0,0,12
lxvd2x 12,0,9
xvadddp 0,0,12
xvadddp 0,0,12
xvadddp 0,0,12
stxvd2x 0,0,9
addi 9,9,176
cmpld 7,28,9
bne 7,.L536
The "nice" hybrid serial/SIMD (the loop does less iteration) is:
.L547:
std 31,128(1)
std 31,136(1)
lfd 12,24(9)
stxvd2x 63,1,30
lfd 11,16(9)
fadd 10,12,12
fadd 9,11,11
fadd 10,10,12
fadd 9,9,11
fadd 10,10,12
fadd 9,9,11
lxvd2x 0,0,9
std 31,480(1)
std 31,488(1)
stfd 11,128(1)
stfd 12,136(1)
stxvd2x 63,1,29
stxvd2x 0,1,30
fadd 10,10,12
fadd 9,9,11
stfd 10,24(9)
stfd 9,16(9)
lxvd2x 10,0,9
stfd 11,480(1)
stfd 12,488(1)
stxvd2x 10,1,29
xvadddp 0,0,10
lxvd2x 12,0,9
xvadddp 0,0,12
xvadddp 0,0,12
xvadddp 0,0,12
stxvd2x 0,0,9
addi 9,9,352
cmpld 7,28,9
bne 7,.L547
The benchmark (one thread but should I use two ?) of the first code is 0.2 [s] whereas the hybrid version is 0.25 [s]. My knowledge on processors architecture is too limited to understand why the hybrid version is slower.
Generate assembly language mixing vector and serial instructions was a charming idea, so if anybody has a suggestion, is it possible or not ?
Best,
++t
ps1: a SIMD unroll version should be faster, I know and I did it, but I am now focusing on this hybrid version.
ps2: gcc 4.9.1, Power7-IBM,8205-E6C
Upvotes: 4
Views: 168
Reputation: 1062
I don't have any hands on experience with these, but according to this PDF, it sounds like the 7 series merged the previously separate scalar and vector floating point units to save die space. If that is accurate, the interleaving won't be able to achieve any kind of parallelization beyond the vectorized instructions.
From the abstract:
Unlike previous PowerPC designs, the POWER7 FPU merges the scalar and vector FPUs into a single unit executing three floating-point instruction sets
Do you have access to a POWER6 to test your interleaved code? I would be interested to see how that goes.
Upvotes: 2