Reputation: 3072
When should i use the streaming version and when the SSE2 vs _mm_load_si128? What is the performance trade-off?
Upvotes: 7
Views: 6679
Reputation: 435
The streaming load intrinsic (mm_stream_load_si128
) performs the load "using a non-temporal memory hint" (according to the Intel Intrinsics Guide). This means that the value loaded will not cause anything to be evicted from the cache.
This is useful if you are assembling a lot of data together that you are going to operate on immediately and not look at again for a "long" time. Most commonly this happens during streaming operations. I have used it when I know I am performing a simple operation on a large data set, where I know the data will quickly get evicted from the cache anyway. Operations such as memcpy
also fall under this category.
The non-streaming load (mm_load_si128
) will retrieve the value and it will be subject to normal caching rules. It may evict old cache entries if needed, and will be able to be retrieved from the cache until it is evicted.
If you expect to use the data again before a normal cache eviction would occur, then the non-streaming load is preferred. If you are operating on a large data set where a given piece of data is not expected to be accessed again before it would have been kicked out of the cache, the streaming load is preferred.
Upvotes: 7