Reputation: 52646
Problem :
I converted a MMX to code to corresponding SSE2 code. And I expected almost 1.5x-2x speedup. But both took exactly same time. Why is it?
Scenario:
I am learning SIMD instruction set and their performance comparison. I took an array operation such that, Z = X^2 + Y^2
where X and Y are large one dimensional array of type "char". The values of X and Y are restricted to be less than 10, so that Z is always <255 (1 Byte). ( Not to worry about any overflow).
I wrote its C++ code first, checked its time. Then wrote corresponding ASSEMBLY code (~3x speedup). Then I wrote its MMX code (~12x v/s C++). Then I converted MMX into SSE2 code and it takes exactly same speed as that of MMX code. Theoretically, in SSE2, I expected a speedup of ~2x compared to MMX.
For conversion from MMX to SSE2, I converted all mmx reg to xmm reg. Then changed a couple of movement instructions and so on.
My MMX and SSE codes are pasted here : https://gist.github.com/abidrahmank/5281486 (I don't want to paste them all here)
These functions are later called from main.cpp file where arrays are passed as arguments.
What I have done :
1 - I went through some optimization manuals from Intel and other websites. Main problem with SSE2 codes is the 16 _memory alignment. When I manually checked the addresses, they all are found to be 16 _memory aligned. But I used both MOVDQU and MOVDQA, but both gives the same result and no speedup compared to MMX.
2 - I went to debug mode and checked each register values with instructions executed. And they are being executed exactly same as I thought, ie 16 bytes are taken and resulting 16 bytes are outputted.
Resources :
I am using Intel Core i5 processor with Windows 7 and Visual C++ 2010.
Question :
So final question is, why there is no performance improvement for SSE2 code compared to MMX code ? Am I doing any thing wrong in SSE code ? Or is there any other explanation ?
Upvotes: 3
Views: 3867
Reputation: 106167
Harold’s comment was absolutely correct. The arrays that you are processing do not fit into cache on your machine, so your computation is entirely load store bound.
I timed the throughput of your computation on a current-generation i7 for various buffer lengths, and also the throughput of the same routine with everything except for the loads and stores removed:
What we observe here is that once the buffer gets so big that it is out of the L3 cache, the throughput of your computation exactly matches the achieved load/store bandwidth. This tells us that how you process the data makes essentially no difference (unless you make it significantly slower); the speed of computation is limited by the ability of the processor to move data to/from memory.
If you do your timing on smaller arrays, you will see a difference between your two implementations.
Upvotes: 4