Mr.Grey
Mr.Grey

Reputation: 335

What's the difference between SIMD and SSE?

I am confused, what's the difference between SIMD and SSE, SSE2, SSE3, AVX etc?

According to my knowledge and research, SIMD is architecture which allows for a Single Instruction to operate on multiple data and SSE, AVX are instruction sets which implement a SIMD architecture.

And also is there a difference between vector sizes of each architecture like SSE has 128 bits and AVX has 256 bits? If the underlying SIMD architecture is the same (I think), then how do different ISAs have different vector sizes?

I'm not sure if this is true, can someone explain to me in detail what actually happens?

Upvotes: 9

Views: 7795

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364068

SIMD = Single Instruction, Multiple Data. It's a concept in CPU architecture.

Many ISAs have SIMD extensions, like PowerPC's AltiVec, ARM's NEON / AArch64's ASIMD, etc.

SSE is an instruction-set extension for x86. (And baseline for x86-64, along with SSE2).

SSE1 and SSE2 provide a bunch of SIMD load/store and computation instructions (128-bit vector width) for float (SSE1), double, and 8 to 64-bit integer types (SSE2). Instructions like addps xmm, xmm/m128 (add Packed Single-precision) and pmaddwd xmm, xmm/m128 (SSE2).

But SIMD isn't the only thing that came with SSE

SSE1+SSE2 also provide scalar instructions for float/double math in the low elements of XMM registers, making x87 mostly obsolete. Instructions like movsd, addsd (add Scalar Double-precision), ucomisd (compare scalar-double into Integer FLAGS, like fcomi). Before SSE1/2, scalar math was done in the x87 register stack, with one-operand stack instructions, frequently requiring extra fxch instructions when more than one FP variable or temporary was being worked on at once. Bad for instruction-level parallelism and not a good compiler target.

SSE also provides some NT stores like movntps to bypass cache and avoid MESI RFOs (Read For Ownership) when storing large amounts of data, so such writes don't cost double (read to fill cache and then write on eviction). See also Enhanced REP MOVSB for memcpy for more about memory bandwidth and non-RFO stores.

SSE also provides some memory-barrier instructions like sfence (SSE1) and mfence (SSE2). sfence is useful for ordering NT stores wrt. other stores. mfence would have been useful as a StoreLoad barrier if it wasn't slower than a dummy lock or byte [esp], 0. lfence (SSE2) also exists but isn't useful for memory ordering in x86's already strongly-ordered memory model, but is useful for blocking out-of-order exec of instructions like rdtsc. (Does the Intel Memory Model make SFENCE and LFENCE redundant?)

Many ISAs would already have memory-barrier instructions as part of their basic integer ISA, so having these as part of SSE was mostly due to SSE introducing NT stores. Most ISAs already had non-bad scalar FP math instructions so that architectural state could get extended for SIMD, unlike x86 where the x87 stack was inconvenient and small.


CPUs with AVX support that as well as SSE. For mixed element sizes or for "cleanup" of leftover elements with an odd count, it can be useful to use a mix of 128-bit and 256-bit vectors. Or 128-bit vectors just as part of summing elements within a vector down to one, usually after a loop that summed vertically.

But normally in a function that already depends on AVX, you'd use the AVX encoding of 128-bit instructions. CPUs with AVX support both for backwards compatibility; AVX implies SSE. AVX1 + AVX2 provide 256-bit versions of existing FP (AVX1) and integer (AVX2) instructions, as well as adding some new instructions like shuffles.

See https://stackoverflow.com/tags/sse/info for SSE history from MMX and SSE1 through later extensions.

AVX-512 has many more new instructions.

Upvotes: 2

MuertoExcobito
MuertoExcobito

Reputation: 10039

The Wikipedia page (http://en.m.wikipedia.org/wiki/SIMD) does a good job of explaining SIMD, and the instruction sets that implement it.

Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

SIMD is the 'concept', SSE/AVX are implementations of the concept. All SIMD instruction sets are just that, a set of instructions that the CPU can execute on multiple data points. As long as the CPU supports executing the instructions, then it is feasible for multiple SIMD instruction sets to coexist, regardless of data size.

Upvotes: 14

Related Questions