Dave
Dave

Reputation: 1334

SSE alignment of 3D vector

I wish to ensure SSE is used for arithmetic on my 3D (96 bit) float vectors. However, I have read conflicting views on just what is necessary.

Some articles/posts say I need to use a 4D vector and "ignore" the 4th element, some say I must decorate my class with things like __declspec(align(16)) and override the new operator, and some say the compiler is clever enough to align things for me (I really hope this is true!).

I am using the Eigen library, but find that the "unsupported" AlignedVector3 class isn't fit for purpose (e.g. division by zero errors when doing component-wise division, lpNorm function includes the dummy 4th element).

A lot of the articles I've read are several years old now, so I hold out hope that modern compilers/SSE versions/CPUs can just align the data for me, or work with non-16 byte aligned data. Any up to date knowledge on this will be much appreciated!

Upvotes: 1

Views: 1523

Answers (1)

Seb Maire
Seb Maire

Reputation: 130

Actually we use SIMD at work and maybe I can give you my feedback on it. The alignement is something you have to take care of when dealing with SIMD, this is to ensure cache line alignement. However I am not sure if it will still cause a crash if it's not aligned or if the CPU is able to manage anyway (like not aligned scalar types in the old time, it was causing crash, now the CPU handles it but it slows down performances). Maybe you can look here SSE, intrinsics, and alignment It seems to have good answers for the alignement part of the question.

For the fact you are using it as a 3D vector even if it's physically a 4D vector, it's not a really good practice, because you don't profit of the all performance of SIMD instructions. The best way for it to match is to use Structure Of Arrays (SOA).

Note: I am assuming 128 bits SIMD registers mapped to 4 scalar types (int or float)

For example, if you have 4 3D points (or vectors), following your way, you will have 4 4D vectors ignoring the 4th component of each point. In total you end up with 4 * 4 values accessible.

By using SOA, you will have 3 SIMD 128 bits (12 values) registers and you will store your points in the following way. SIMD

  • r1: x x x x
  • r2: y y y y
  • r3: z z z z

This way you fill the entire SIMD registers and thus profit at maximum of SIMD advantages. The other thing is that many of the calculations you will have to make (example add 2 groups of 4 vectors) will only take 3 SIMD instructions. It's a bit tricky to use and understand but when you do, the gain is great.

Of course you won't be able to use it this way in all cases so you will fall back to the original solution of ignoring the last value.

Upvotes: 2

Related Questions