Ecir Hana
Ecir Hana

Reputation: 11488

Transpose 8x8 64-bits matrix

Targeting AVX2, what is a fastest way to transpose a 8x8 matrix containing 64-bits integers (or doubles)?

I searched though this site and I found several ways of doing 8x8 transpose but mostly for 32-bits floats. So I'm mainly asking because I'm not sure whether the principles that made those algorithms fast readily translate to 64-bits and second, apparently AVX2 only has 16 registers so only loading all the values would take up all the registers.

One way of doing it would be to call 2x2 _MM_TRANSPOSE4_PD but I was wondering whether this is optimal:

  #define _MM_TRANSPOSE4_PD(row0,row1,row2,row3)                \
        {                                                       \
            __m256d tmp3, tmp2, tmp1, tmp0;                     \
                                                                \
            tmp0 = _mm256_shuffle_pd((row0),(row1), 0x0);       \
            tmp2 = _mm256_shuffle_pd((row0),(row1), 0xF);       \
            tmp1 = _mm256_shuffle_pd((row2),(row3), 0x0);       \
            tmp3 = _mm256_shuffle_pd((row2),(row3), 0xF);       \
                                                                \
            (row0) = _mm256_permute2f128_pd(tmp0, tmp1, 0x20);  \
            (row1) = _mm256_permute2f128_pd(tmp2, tmp3, 0x20);  \
            (row2) = _mm256_permute2f128_pd(tmp0, tmp1, 0x31);  \
            (row3) = _mm256_permute2f128_pd(tmp2, tmp3, 0x31);  \
        }

Still assuming AVX2, is transposing double[8][8] and int64_t[8][8] largely the same, in principle?

PS: And just being curious, having AVX512 would change the things substantially, correct?

Upvotes: 3

Views: 1104

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364593

For small matrices where more than 1 row can fit in a single SIMD vector, AVX-512 has very nice 2-input lane-crossing shuffles with 32-bit or 64-bit granularity, with a vector control. (Unlike _mm512_unpacklo_pd which is basically 4 separate 128-bit shuffles.)

A 4x4 double matrix is "only" 128 bytes, two ZMM __m512d vectors, so you only need two vpermt2ps (_mm512_permutex2var_pd) to produce both output vectors: one shuffle per output vector, with both loads and stores being full width. You do need control vector constants, though.

Using 512-bit vector instructions has some downsides (clock speed and execution port throughput), but if your program can spend a lot of time in code that uses 512-bit vectors, there's probably a significant throughput gain from throwing around more data with each instruction, and having more powerful shuffles.

With 256-bit vectors, vpermt2pd ymm would probably not be useful for a 4x4, because for each __m256d output row, each of the 4 elements you want comes from a different input row. So one 2-input shuffle can't produce the output you want.

I think lane-crossing shuffles with less than 128-bit granularity aren't useful unless your matrix is small enough to fit multiple rows in one SIMD vector. See How to transpose a 16x16 matrix using SIMD instructions? for some algorithmic complexity reasoning about 32-bit elements - an 8x8 xpose of 32-bit elements with AVX1 is about the same as an 8x8 of 64-bit elements with AVX-512, where each SIMD vector holds exactly one whole row.

So no need for vector constants, just immediate shuffles of 128-bit chunks, and unpacklo/hi


Transposing an 8x8 with 512-bit vectors (8 doubles) would have the same problem: each output row of 8 doubles needs 1 double from each of 8 input vectors. So ultimately I think you want a similar strategy to Soonts' AVX answer, starting with _mm512_insertf64x4(v, load, 1) as the first step to get the first half of 2 input rows into one vector.

(If you care about KNL / Xeon Phi, @ZBoson's other answer on How to transpose a 16x16 matrix using SIMD instructions? shows some interesting ideas using merge-masking with 1-input shuffles like vpermpd or vpermq, instead of 2-input shuffles like vunpcklpd or vpermt2pd)

Using wider vectors means fewer loads and stores, and maybe even fewer total shuffles because each one combines more data. But you also have more shuffling work to do, to get all 8 elements of a row into one vector, instead of just loading and storing to different places in chunks half the size of a row. It's not obvious is better; I'll update this answer if I get around to actually writing the code.

Note that Ice Lake (first consumer CPU with AVX-512) can do 2 loads and 2 stores per clock. It has better shuffle throughput than Skylake-X for some shuffles, but not for any that are useful for this or Soonts' answer. (All of vperm2f128, vunpcklpd and vpermt2pd only run on port 5, for the ymm and zmm versions. https://uops.info/. vinsertf64x4 zmm, mem, 1 is 2 uops for the front-end, and needs a load port and a uop for p0/p5. (Not p1 because it's a 512-bit uop, and see also SIMD instructions lowering CPU frequency).)

Upvotes: 1

Soonts
Soonts

Reputation: 21956

After some thoughts and discussion in the comments, I think this is the most efficient version, at least when source and destination data is in RAM. It does not require AVX2, AVX1 is enough.

The main idea, modern CPUs can do twice as many load micro-ops compared to stores, and on many CPUs loading stuff into higher half of vectors with vinsertf128 has same cost as regular 16-byte load. Compared to your macro, this version no longer needs these relatively expensive (3 cycles of latency on most CPUs) vperm2f128 shuffles.

struct Matrix4x4
{
    __m256d r0, r1, r2, r3;
};

inline void loadTransposed( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
    // Load top half of the matrix into low half of 4 registers
    __m256d t0 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) );     // 00, 01
    __m256d t1 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 02, 03
    rsi += stride;
    __m256d t2 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi ) );     // 10, 11
    __m256d t3 = _mm256_castpd128_pd256( _mm_loadu_pd( rsi + 2 ) ); // 12, 13
    rsi += stride;
    // Load bottom half of the matrix into high half of these registers
    t0 = _mm256_insertf128_pd( t0, _mm_loadu_pd( rsi ), 1 );    // 00, 01, 20, 21
    t1 = _mm256_insertf128_pd( t1, _mm_loadu_pd( rsi + 2 ), 1 );// 02, 03, 22, 23
    rsi += stride;
    t2 = _mm256_insertf128_pd( t2, _mm_loadu_pd( rsi ), 1 );    // 10, 11, 30, 31
    t3 = _mm256_insertf128_pd( t3, _mm_loadu_pd( rsi + 2 ), 1 );// 12, 13, 32, 33

    // Transpose 2x2 blocks in registers.
    // Due to the tricky way we loaded stuff, that's enough to transpose the complete 4x4 matrix.
    mat.r0 = _mm256_unpacklo_pd( t0, t2 ); // 00, 10, 20, 30
    mat.r1 = _mm256_unpackhi_pd( t0, t2 ); // 01, 11, 21, 31
    mat.r2 = _mm256_unpacklo_pd( t1, t3 ); // 02, 12, 22, 32
    mat.r3 = _mm256_unpackhi_pd( t1, t3 ); // 03, 13, 23, 33
}

inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
    _mm256_storeu_pd( rdi, mat.r0 );
    _mm256_storeu_pd( rdi + stride, mat.r1 );
    _mm256_storeu_pd( rdi + stride * 2, mat.r2 );
    _mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}

// Transpose 8x8 matrix of double values
void transpose8x8( double* rdi, const double* rsi )
{
    Matrix4x4 block;
    // Top-left corner
    loadTransposed( block, rsi );
    store( block, rdi );

#if 1
    // Using another instance of the block to support in-place transpose
    Matrix4x4 block2;
    loadTransposed( block, rsi + 4 );       // top right block
    loadTransposed( block2, rsi + 8 * 4 ); // bottom left block

    store( block2, rdi + 4 );
    store( block, rdi + 8 * 4 );
#else
    // Flip the #if if you can guarantee ( rsi != rdi )
    // Performance is about the same, but this version uses 4 less vector registers,
    // slightly more efficient when some registers need to be backed up / restored.
    assert( rsi != rdi );
    loadTransposed( block, rsi + 4 );
    store( block, rdi + 8 * 4 );

    loadTransposed( block, rsi + 8 * 4 );
    store( block, rdi + 4 );
#endif
    // Bottom-right corner
    loadTransposed( block, rsi + 8 * 4 + 4 );
    store( block, rdi + 8 * 4 + 4 );
}

For completeness, here’s a version which uses the code very similar to your macro, does twice as few loads, same count of stores, and more shuffles. Have not benchmarked but I would expect it to be slightly slower.

struct Matrix4x4
{
    __m256d r0, r1, r2, r3;
};

inline void load( Matrix4x4& mat, const double* rsi, size_t stride = 8 )
{
    mat.r0 = _mm256_loadu_pd( rsi );
    mat.r1 = _mm256_loadu_pd( rsi + stride );
    mat.r2 = _mm256_loadu_pd( rsi + stride * 2 );
    mat.r3 = _mm256_loadu_pd( rsi + stride * 3 );
}

inline void store( const Matrix4x4& mat, double* rdi, size_t stride = 8 )
{
    _mm256_storeu_pd( rdi, mat.r0 );
    _mm256_storeu_pd( rdi + stride, mat.r1 );
    _mm256_storeu_pd( rdi + stride * 2, mat.r2 );
    _mm256_storeu_pd( rdi + stride * 3, mat.r3 );
}

inline void transpose( Matrix4x4& m4 )
{
    // These unpack instructions transpose lanes within 2x2 blocks of the matrix
    const __m256d t0 = _mm256_unpacklo_pd( m4.r0, m4.r1 );
    const __m256d t1 = _mm256_unpacklo_pd( m4.r2, m4.r3 );
    const __m256d t2 = _mm256_unpackhi_pd( m4.r0, m4.r1 );
    const __m256d t3 = _mm256_unpackhi_pd( m4.r2, m4.r3 );
    // Produce the transposed matrix by combining these blocks
    m4.r0 = _mm256_permute2f128_pd( t0, t1, 0x20 );
    m4.r1 = _mm256_permute2f128_pd( t2, t3, 0x20 );
    m4.r2 = _mm256_permute2f128_pd( t0, t1, 0x31 );
    m4.r3 = _mm256_permute2f128_pd( t2, t3, 0x31 );
}

// Transpose 8x8 matrix with double values
void transpose8x8( double* rdi, const double* rsi )
{
    Matrix4x4 block;
    // Top-left corner
    load( block, rsi );
    transpose( block );
    store( block, rdi );

    // Using another instance of the block to support in-place transpose, with very small overhead
    Matrix4x4 block2;
    load( block, rsi + 4 );     // top right block
    load( block2, rsi + 8 * 4 ); // bottom left block

    transpose( block2 );
    store( block2, rdi + 4 );
    transpose( block );
    store( block, rdi + 8 * 4 );

    // Bottom-right corner
    load( block, rsi + 8 * 4 + 4 );
    transpose( block );
    store( block, rdi + 8 * 4 + 4 );
}

Upvotes: 3

Related Questions