Reputation: 101

Clamp unsigned int to 0x10000 using SSE2

I want to clamp 32-bit unsigned ints to fixed value (0x10000) using only SSE2 instructions.

Basically, this C code: if (c>0x10000) c=0x10000;

This code below works, but I'm wondering if it can be simplified, considering it's a specific constant (0xFFFF+0x0001)

movdqa    xmm3, xmm0 <-- xmm0 contains 4 dword unsigned values
movdqa    xmm4, xmm5 <-- xmm5: four dword 0x10000 values
pxor      xmm3, xmm5
pcmpgtd   xmm4, xmm0
psrad     xmm3, 31
pxor      xmm4, xmm3
pand      xmm0, xmm4
pandn     xmm4, xmm5
por       xmm0, xmm4

The value of c is in the range 0x00000000-0xFFFFFFFF, but code that assumes it is in the range 0x00000000-0x00FFFFFF or 0x00000000-0x00FF0000 may be acceptable.

Upvotes: 6

Answers (3)

chtz

Reputation: 18827

Here is a SSE2 solution working on the full range using saturated addition/subtraction. It requires 4 uops and 2 constants (and one copy):

(Edit: minor improvement to previous version. Neither of the required constants get destroyed)

The right column describes what happens if the high 16 bits of the input (x.h) are zero (in that case x.l needs to be returned) or not zero (in that case 0x10000 needs to be returned.

// assumes xmm1 contains 0xffffffff -- can be generated by pcmpeqd
// assumes xmm3 contains 0xfffe0000 -- could be generated by left-shifting a ffffffff vector

                           x.h==0      x.h!=0
    paddusw xmm0, xmm3     [fffe,x.l]  [ffff,x.l]
    movdqa  xmm2, xmm0
    psrld   xmm2, 16       [0000,fffe] [0000,ffff]
    psubw   xmm2, xmm1     [0001,ffff] [0001,0000]
    pand    xmm0, xmm2     [0000,x.l]  [0001,0000]

If you have SSE4.1, of course pminud is simpler and better. And if you don't need to cover the full input range of xmm0, the solution by fuz is more generic, easier and more straight-forward (it also has a slightly smaller dependency chain and requires just one constant vector.)

Upvotes: 6

aqrit

Reputation: 1185

assumes it is in the range 0x00000000-0x00FFFFFF

minps     xmm0, xmm5

This works if you haven't set DAZ (Denormals Are Zero) in MXCSR. With DAZ set (bit 1<<6 = 0x40), minps treats 0x10000 as representing exactly 0.0, so the result is 0x00000000.

This is very slow on some of the CPUs where it would be useful (because of microcode assists for denormals), including first-gen Core 2 Duo (E6600) which has SSSE3 but not SSE4.1 for pminud. A test loop has a throughput of 1/clock minps clock with normalized inputs, but with these subnormals it averages 119 cycles per minps. It's fast on Skylake even with subnormals.

Note that linking with gcc -ffast-math will include CRT startup code that sets FTZ and DAZ, so real programs can have it set without doing any x86-specific stuff. DAZ avoids minps slowdowns on CPUs like Core 2, but of course makes it non-useful for playing with small integers.

(FTZ doesn't affect minps; it doesn't have to round its output.)

This might have some extra bypass latency between SIMD-integer instructions (and itself has multi-cycle latency), but still better for throughput than SSE2 emulation of SSE4.1 pminsd / pminud on CPUs where it doesn't take a microcode assist due to subnormal inputs.

Integer values in this limited range are bit-patterns for finite non-negative floats (IEEE binary32). Larger integer bit-patterns represent larger-magnitude values, up to the first NaN (0x7F800001).

Half the values in this range have exponent field = 0 (bits 30:23), so are subnormal aka denormal floats. 0x00800000 is the bit-pattern for the smallest normalized float.

Upvotes: 5

fuz

Reputation: 93127

If the range can be assumed to be 0x00000000 to 0x7fffffff or narrower, you can pretend the values are signed and simplify the sequence to:

; xmm0 contains 4 dword unsigned values (input)
; xmm5 contains [0x10000, 0x10000, 0x10000, 0x10000]
movdqa    xmm1, xmm5
pcmpgtd   xmm1, xmm0  ; input < 0x10000
pand      xmm0, xmm1  ; input < 0x10000 ? input :       0
pandn     xmm1, xmm5  ; input < 0x10000 ?     0 : 0x10000
por       xmm0, xmm1  ; input < 0x10000 ? input : 0x10000

With SSE4.1, you can further simplify the code to just

pminud    xmm0, xmm5  ; input < 0x10000 ? input : 0x10000

Upvotes: 5

Clamp unsigned int to 0x10000 using SSE2

Answers (3)

Related Questions