Reputation: 101
I want to clamp 32-bit unsigned ints to fixed value (0x10000) using only SSE2 instructions.
Basically, this C code:
if (c>0x10000) c=0x10000;
This code below works, but I'm wondering if it can be simplified, considering it's a specific constant (0xFFFF+0x0001)
movdqa xmm3, xmm0 <-- xmm0 contains 4 dword unsigned values
movdqa xmm4, xmm5 <-- xmm5: four dword 0x10000 values
pxor xmm3, xmm5
pcmpgtd xmm4, xmm0
psrad xmm3, 31
pxor xmm4, xmm3
pand xmm0, xmm4
pandn xmm4, xmm5
por xmm0, xmm4
The value of c
is in the range 0x00000000-0xFFFFFFFF, but code that assumes it is in the range 0x00000000-0x00FFFFFF or 0x00000000-0x00FF0000 may be acceptable.
Upvotes: 6
Views: 355
Reputation: 18827
Here is a SSE2 solution working on the full range using saturated addition/subtraction. It requires 4 uops and 2 constants (and one copy):
(Edit: minor improvement to previous version. Neither of the required constants get destroyed)
The right column describes what happens if the high 16 bits of the input (x.h
) are zero (in that case x.l
needs to be returned) or not zero (in that case 0x10000
needs to be returned.
// assumes xmm1 contains 0xffffffff -- can be generated by pcmpeqd
// assumes xmm3 contains 0xfffe0000 -- could be generated by left-shifting a ffffffff vector
x.h==0 x.h!=0
paddusw xmm0, xmm3 [fffe,x.l] [ffff,x.l]
movdqa xmm2, xmm0
psrld xmm2, 16 [0000,fffe] [0000,ffff]
psubw xmm2, xmm1 [0001,ffff] [0001,0000]
pand xmm0, xmm2 [0000,x.l] [0001,0000]
If you have SSE4.1, of course pminud
is simpler and better. And if you don't need to cover the full input range of xmm0
, the solution by fuz is more generic, easier and more straight-forward (it also has a slightly smaller dependency chain and requires just one constant vector.)
Upvotes: 6
Reputation: 1185
assumes it is in the range 0x00000000-0x00FFFFFF
minps xmm0, xmm5
This works if you haven't set DAZ (Denormals Are Zero) in MXCSR. With DAZ set (bit 1<<6 = 0x40
), minps
treats 0x10000
as representing exactly 0.0
, so the result is 0x00000000
.
This is very slow on some of the CPUs where it would be useful (because of microcode assists for denormals), including first-gen Core 2 Duo (E6600) which has SSSE3 but not SSE4.1 for pminud
. A test loop has a throughput of 1/clock minps
clock with normalized inputs, but with these subnormals it averages 119 cycles per minps
. It's fast on Skylake even with subnormals.
Note that linking with gcc -ffast-math
will include CRT startup code that sets FTZ and DAZ, so real programs can have it set without doing any x86-specific stuff. DAZ avoids minps
slowdowns on CPUs like Core 2, but of course makes it non-useful for playing with small integers.
(FTZ doesn't affect minps
; it doesn't have to round its output.)
This might have some extra bypass latency between SIMD-integer instructions (and itself has multi-cycle latency), but still better for throughput than SSE2 emulation of SSE4.1 pminsd
/ pminud
on CPUs where it doesn't take a microcode assist due to subnormal inputs.
Integer values in this limited range are bit-patterns for finite non-negative floats (IEEE binary32). Larger integer bit-patterns represent larger-magnitude values, up to the first NaN (0x7F800001
).
Half the values in this range have exponent field = 0 (bits 30:23), so are subnormal aka denormal floats. 0x00800000 is the bit-pattern for the smallest normalized float.
Upvotes: 5
Reputation: 93127
If the range can be assumed to be 0x00000000
to 0x7fffffff
or narrower, you can pretend the values are signed and simplify the sequence to:
; xmm0 contains 4 dword unsigned values (input)
; xmm5 contains [0x10000, 0x10000, 0x10000, 0x10000]
movdqa xmm1, xmm5
pcmpgtd xmm1, xmm0 ; input < 0x10000
pand xmm0, xmm1 ; input < 0x10000 ? input : 0
pandn xmm1, xmm5 ; input < 0x10000 ? 0 : 0x10000
por xmm0, xmm1 ; input < 0x10000 ? input : 0x10000
With SSE4.1, you can further simplify the code to just
pminud xmm0, xmm5 ; input < 0x10000 ? input : 0x10000
Upvotes: 5