I would like to understand the potential gain of using Streaming Simd Extensions (SSE) for bitwise operations between integers in the following minimal example in C. Suppose that one makes a bitwise operation between two 64-bit unsigned long long ints a and b (1), e.g., a ^ b make the same bitwise operation between two 128-bit integers A and B with SSE. I would like to know whether executing (1) takes the same time as (2). For example, one may try a timing experiment, where one measures the time to make N >> 1 bitwise operations (1) and the time to make the same number of operations (2). Are these times roughly the same? If not, what would be their ratio on a specific machine? How about the same question for 256 or larger SSE extensions?

Reputation:

Bitwise operations with Streaming Simd Extensions (SSE)

I would like to understand the potential gain of using Streaming Simd Extensions (SSE) for bitwise operations between integers in the following minimal example in C.

Suppose that one

makes a bitwise operation between two 64-bit unsigned long long ints a and b (1), e.g., a ^ b
make the same bitwise operation between two 128-bit integers A and B with SSE.

I would like to know whether executing (1) takes the same time as (2).

For example, one may try a timing experiment, where one measures the time to make N >> 1 bitwise operations (1) and the time to make the same number of operations (2).

Are these times roughly the same? If not, what would be their ratio on a specific machine? How about the same question for 256 or larger SSE extensions?

Upvotes: 1

Answers (1)

Peter Cordes

Reputation: 365257

Are you talking about as part of a compiled C function? Compilers can easily auto-vectorize loops over arrays with AVX2 vpxor or AVX1 vxorps, so how a ^ operator compiles depends on the surrounding context.

Obviously you have to compile with optimization enabled for any benchmark to be meaningful.

As far as what the hardware can do at an asm level, compiler-generated or hand-written doesn't matter; using intrinsics is a convenient way to get compilers to emit SIMD instructions.

Let's take Intel Haswell as an example. With no memory bottlenecks, just operating on local variables in registers, with AVX2 you can get 3x vpxor ymm per clock (plus one other non-SIMD uop), so that's 3x 256 bits of XOR. (128-bit SSE2 pxor xmm has the same throughput as 256-bit AVX2 vpxor, on Intel CPUs, so wider vectors are pure win for throughput).

Or with purely scalar code you can do 4x scalar 8/16/32/64-bit xor per clock on Haswell, if you have no other instructions.

Both vpxor and xor are single uop, with 1 cycle latency.

On AMD Bulldozer-family and earlier, pxor / vpxor has 2 cycle latency, but 2 per clock throughput, so the perf difference between a latency bottleneck vs. a throughput bottleneck is a factor of 4.

CPU performance on such small scales is not one-dimensional. Superscalar pipelined out-of-order CPUs make the question you're asking of "does one take longer" too simplistic. See my answer on What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, specifically the "There are three major dimensions to analyze for a short block" section.

See https://agner.org/optimize/ and other performance links in the x86 tag wiki.

Upvotes: 4

Bitwise operations with Streaming Simd Extensions (SSE)

Answers (1)

Related Questions