Reputation: 103
I have written a function to do the transpose of a 4x4 matrix, but I do not know how to extend the code for a matrix m x n.
Where can I find maybe some sample code on matrix operations with SSE? product, transpose, inverse, etc?
This is the code of transpose 4x4:
void transpose(float* src, int n) {
__m128 row0, row1, row2, row3;
__m128 tmp1;
tmp1=_mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
row1=_mm_loadh_pi(_mm_loadl_pi(row1, (__m64*)(src+8)), (__m64*)(src+12));
row0=_mm_shuffle_ps(tmp1, row1, 0x88);
row1=_mm_shuffle_ps(row1, tmp1, 0xDD);
tmp1=_mm_movelh_ps(tmp1, row1);
row1=_mm_movehl_ps(tmp1, row1);
tmp1=_mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src+ 2)), (__m64*)(src+ 6));
row3= _mm_loadh_pi(_mm_loadl_pi(row3, (__m64*)(src+10)), (__m64*)(src+14));
row2=_mm_shuffle_ps(tmp1, row3, 0x88);
row3=_mm_shuffle_ps(row3, tmp1, 0xDD);
tmp1=_mm_movelh_ps(tmp1, row3);
row3=_mm_movehl_ps(tmp1, row3);
_mm_store_ps(src, row0);
_mm_store_ps(src+4, row1);
_mm_store_ps(src+8, row2);
_mm_store_ps(src+12, row3);
}
Upvotes: 1
Views: 465
Reputation: 33709
I'm not sure how to do a in-place transpose for arbitrary matrices using SIMD efficiently but I do know how to do it for out-of-place. Let me describe how to do both
In place transpose
For in-place transpose you should see Agner Fog's Optimizing software in C++ manual. See section 9.10 "Cache contentions in large data structures" example 9.5a. For certain matrix sizes you will see a large drop in performance due to cache aliasing. See table 9.1 for examples and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?. Agner gives a way to fix this using loop tiling (similar to what Paul R described) in Example 9.5b.
Out of place transpose
See my answer here (the one with the most votes) What is the fastest way to transpose a matrix in C++?. I have not looked into this in ages but let me just repeat my code here:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
}
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
}
}
}
}
}
Upvotes: 2
Reputation: 213210
Here is one general approach you can use for transposing an NxN matrix using tiling. You could even use your existing 4x4 transpose and work with a 4x4 tile size:
for each 4x4 block in the matrix with top left indices r, c
if block is on diagonal (i.e. if r == c)
get block a = 4x4 block at r, c
transpose block a
store block a at r, c
else if block is above diagonal (i.e. if r < c)
get block a = 4x4 block at r, c
get block b = 4x4 block at c, r
transpose block a
transpose block b
store transposed block a at c, r
store transposed block b at r, c
else // block is below diagonal
do nothing
endif
endfor
Obviously N needs to be a multiple of 4 for this to work, otherwise you will need to do some additional housekeeping.
As mentioned above in the comments, an MxN in-place transpose is hard to do - you need to either use an additional temporary matrix (which effectively makes it a not-in-place transpose) or use the method described here, but this will be much harder to vectorize with SIMD.
Upvotes: 2