grunge fightr
grunge fightr

Reputation: 1370

ZeroMemory in SSE

I need simple ZeroMemory implementation with SSE (SSE2 prefered) Can someone help with that. I was serching thru SO and net but not found direct answer to that.

Upvotes: 2

Views: 2251

Answers (3)

doug65536
doug65536

Reputation: 6781

Almost all of the transistors in your CPU are used to somehow make memory access as fast as possible. The CPU is already doing an amazing job at all memory accesses, and the instructions run at a drastically faster rate than possible memory accesses.

Therefore, trying to beat memset is a mostly futile exercise in most cases because it is already limited by the speed of your memory (as mentioned by others).

Upvotes: 0

GJ.
GJ.

Reputation: 10937

I you want to speed up your code than you must exactly understand how your CPU works and where is the bottleneck.

Here you are my speed optimized routine just to show how should be made.

On my PC is about 5 time faster (clear 1MBytes mem block) than your, test it and ask if somethink isn't clear:

//edx = memory pointer must be 16 bytes aligned
//ecx = memory count must be multiple of 16 
    xorps       xmm0, xmm0                      //Clear xmm0
    mov         eax, ecx                        //Save ecx to eax
    and         ecx, 0FFFFFF80h                 //Clear only 128 byte pages
    jz          @ClearRest                      //Less than 128 bytes to clear
@Aligned128BMove:
    movdqa      [edx], xmm0                     //Clear first 16 bytes of 128 bytes 
    movdqa      [edx + 10h], xmm0               //Clear second 16 bytes of 128 bytes 
    movdqa      [edx + 20h], xmm0               //...
    movdqa      [edx + 30h], xmm0
    movdqa      [edx + 40h], xmm0
    movdqa      [edx + 50h], xmm0
    movdqa      [edx + 60h], xmm0
    movdqa      [edx + 70h], xmm0
    add         edx, 128                        //inc mem pointer
    sub         ecx, 128                        //dec counter
    jnz         @Aligned128BMove
@ClearRest:
    and         eax, 07Fh                       //Clear the rest
    jz          @Exit
@LoopRest:
    movdqa      [edx], xmm0
    add         edx, 16
    sub         eax, 16
    jnz         @LoopRest
@Exit:

Upvotes: 1

tc.
tc.

Reputation: 33592

Is ZeroMemory() or memset() not good enough?

Disclaimer: Some of the following may be SSE3.

  1. Fill any unaligned leading bytes by looping until the address is a multiple of 16
  2. push to save an xmm reg
  3. pxor to zero the xmm reg
  4. While the remaining length >= 16,
    1. movdqa or movntdq to do the write
  5. pop to restore the xmm reg.
  6. Fill any unaligned trailing bytes.

movntdq may appear to be faster because it tells the processor to not bring the data into your cache, but this can cause a performance penalty later if the data is going to be used. It may be more appropriate if you are scrubbing memory before freeing it (like you might do with SecureZeroMemory()).

Upvotes: 5

Related Questions