Reputation: 107
I'm using MSVC12 (Visual Studio 2013 Express) and I try to implemenent a fast multiplication of 8*8 float values. The problem is the alignment: The vector has actually 9*n values, but I always just need the first 8, so e.g. for n=0 the alignment of 32 bytes is guaranteed (when I use _mm_malloc), for n=1 the "first" value is aligned at 4*9 = 36 bytes.
for(unsigned i = 0; i < n; i++) {
float *coeff_set = (float *)_mm_malloc(909 * 100 *sizeof(float), 32);
// this works for n=0, not n=1, n=2, ...
__m256 coefficients = _mm256_load_ps(&coeff_set[9 * i]);
__m256 result = _mm256_mul_ps(coefficients, coefficients);
...
}
Is there any possibility to solve this? I would like to keep the structure of my data, but if not possible, I would change it. One solution I found was to copy the 8 floats first in an aligned array, and then load it, but the performance-loss is way too high then.
Upvotes: 3
Views: 1307
Reputation: 11706
You have two choices:
_mm256_loadu_ps
intrinsic for unaligned accessesThe first choice is more speed-efficient, while the second is more space-efficient.
Upvotes: 4