Reputation: 63
I am investigating how many FLOPs could be done in one CPU cycles using gotoblas library. I used 32-bit floating point number to run a matrix multiplication, and got roughly 8 FLOPs per CPU cycle by hand calculation. I guess this may be because there are two FPUs in my processor (Intel Xeon E5430), each of which takes care of one SSE instruction over 128-bit XMM registers. Therefore, using 32-bit floating point numbers, I got 2*4 FLOPs per CPU cycle.
Is my guess correct? Is there an official manual I can refer to get the number of FPUs in one Intel processor?
Thanks!
Upvotes: 2
Views: 988
Reputation: 63
I think I found out the reason. Theoretically Intel Xeon E5430 can do 4-wide SSE addition + 4-wide SSE multiplication together in one CPU cycle for single precision floating point numbers.
Upvotes: 1