Reputation: 2193
As a learning exercise I'm trying my hand at speeding up matrix multiplication code using SIMD on various architectures. I'm having a weird issue with my 3D matrix multiplication code for SSE2 where its performance jumps between two extremes, either ~5ms (expected) or ~100ms for 1 million operations.
The only thing "bad" that this code is doing is the unaligned stores/loads and the hack at the end to store a vector into memory without the 4th element trampling memory. This would explain some performance variance, but the fact that the performance difference is so large makes me suspect I'm missing something important.
I've tried a couple of things but I'll have another crack at it after some sleep.
See code below. The m_matrix variable is aligned on the 16 byte boundary.
void Matrix3x3::MultiplySSE2(Matrix3x3 &other, Matrix3x3 &output)
{
__m128 a_row, r_row;
__m128 a1_row, r1_row;
__m128 a2_row, r2_row;
const __m128 b_row0 = _mm_load_ps(&other.m_matrix[0]);
const __m128 b_row1 = _mm_loadu_ps(&other.m_matrix[3]);
const __m128 b_row2 = _mm_loadu_ps(&other.m_matrix[6]);
// Perform dot products with first row
a_row = _mm_set1_ps(m_matrix[0]);
r_row = _mm_mul_ps(a_row, b_row0);
a_row = _mm_set1_ps(m_matrix[1]);
r_row = _mm_add_ps(_mm_mul_ps(a_row, b_row1), r_row);
a_row = _mm_set1_ps(m_matrix[2]);
r_row = _mm_add_ps(_mm_mul_ps(a_row, b_row2), r_row);
_mm_store_ps(&output.m_matrix[0], r_row);
// Perform dot products with second row
a1_row = _mm_set1_ps(m_matrix[3]);
r1_row = _mm_mul_ps(a1_row, b_row0);
a1_row = _mm_set1_ps(m_matrix[4]);
r1_row = _mm_add_ps(_mm_mul_ps(a1_row, b_row1), r1_row);
a1_row = _mm_set1_ps(m_matrix[5]);
r1_row = _mm_add_ps(_mm_mul_ps(a1_row, b_row2), r1_row);
_mm_storeu_ps(&output.m_matrix[3], r1_row);
// Perform dot products with third row
a2_row = _mm_set1_ps(m_matrix[6]);
r2_row = _mm_mul_ps(a2_row, b_row0);
a2_row = _mm_set1_ps(m_matrix[7]);
r2_row = _mm_add_ps(_mm_mul_ps(a2_row, b_row1), r2_row);
a2_row = _mm_set1_ps(m_matrix[8]);
r2_row = _mm_add_ps(_mm_mul_ps(a2_row, b_row2), r2_row);
// Store only the first 3 elements in a vector so we dont trample memory
_mm_store_ss(&output.m_matrix[6], _mm_shuffle_ps(r2_row, r2_row, _MM_SHUFFLE(0, 0, 0, 0)));
_mm_store_ss(&output.m_matrix[7], _mm_shuffle_ps(r2_row, r2_row, _MM_SHUFFLE(1, 1, 1, 1)));
_mm_store_ss(&output.m_matrix[8], _mm_shuffle_ps(r2_row, r2_row, _MM_SHUFFLE(2, 2, 2, 2)));
}
Upvotes: 0
Views: 232
Reputation: 363940
A performance hit like that sounds like your data is maybe crossing a page line sometimes, not just a cache-line. If you're testing on a buffer of many different matrices, rather than the same small matrix repeatedly, maybe something else running on another CPU core is pushing your buffer out of L3?
performance issues in your code (which don't explain the factor-of-20 variance. These should always be slow):
_mm_set1_ps(m_matrix[3])
and so on is going to be a problem. It takes a pshufd
or movaps + shufps
to broadcast an element. I think this is unavoidable for matmuls, though.
Storing the final 3 elements without writing past the end: Try PALIGNR
to get the last element of the previous row into a reg with the last row. Then you can do a single unaligned store, which overlaps with the preceding store. This is a lot fewer shuffles, and is probably faster than movss
/ extractps
/ extractps
.
If you want to try something with fewer unaligned 16B stores, try movss
, shuffle or right-shift by 4 bytes (psrldq
aka _mm_bsrli_si128
), then movq
or movsd
to store the last 8 bytes in one go. (byte-wise shift is on the same execution port as shuffles, unlike the per element bit-shifts)
Why did do you do three _mm_shuffle_ps
(shufps
)? The low element is already the one you want, for the first column of the last row. Anyway, I think extractps
is faster than shuffle + store, on non-AVX where preserving the source from being clobbered by shufps
takes a move. pshufd
would work.)
Upvotes: 2