Mokosha
Mokosha

Reputation: 2822

Fastest way to transpose 4x4 byte matrix

I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from

A B C D
E F G H
I J K L
M N O P

to

A E I M
B F J N
C G K O
D H L P

We can assume that I have valid pointers pointing to A, E, I, and M in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD).

This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS.

Upvotes: 8

Views: 8577

Answers (5)

Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:

void transpose(uint32_t const in[4], uint32_t out[4]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0] = in[0] & 0xFF000000U; // A . . .
  out[1] = in[1] & 0x00FF0000U; // . F . .
  out[2] = in[2] & 0x0000FF00U; // . . K .
  out[3] = in[3] & 0x000000FFU; // . . . P

  out[1] |= (in[0] <<  8) & 0xFF000000U; // B F . .
  out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
  out[3] |= (in[0] << 24);               // D . . P

  out[0] |= (in[1] >>  8) & 0x00FF0000U; // A E . .
  out[2] |= (in[1] <<  8) & 0x00FF0000U; // C G K .
  out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P

  out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
  out[1] |= (in[2] >>  8) & 0x0000FF00U; // B F J .
  out[3] |= (in[2] <<  8) & 0x0000FF00U; // D H L P

  out[0] |= (in[3] >> 24);               // A E I M
  out[1] |= (in[3] >>  8) & 0x000000FFU; // B F J N
  out[2] |= (in[3] <<  8) & 0x000000FFU; // C G K O
}

I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.

Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.

So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.

If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:

void transpose(uint8_t const in[16], uint8_t out[16]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0]  = in[0];  // A . . .
  out[1]  = in[4];  // A E . .
  out[2]  = in[8];  // A E I .
  out[3]  = in[12]; // A E I M
  out[4]  = in[1];  // B . . .
  out[5]  = in[5];  // B F . .
  out[6]  = in[9];  // B F J .
  out[7]  = in[13]; // B F J N
  out[8]  = in[2];  // C . . .
  out[9]  = in[6];  // C G . .
  out[10] = in[10]; // C G K .
  out[11] = in[14]; // C G K O
  out[12] = in[3];  // D . . .
  out[13] = in[7];  // D H . .
  out[14] = in[11]; // D H L .
  out[15] = in[15]; // D H L P
}

If you really want to shuffle it in-place, then the following would do.

void transpose(uint8_t m[16]) {
  std::swap(m[1], m[4]);
  std::swap(m[2], m[8]);
  std::swap(m[3], m[12]);
  std::swap(m[6], m[9]);
  std::swap(m[7], m[13]);
  std::swap(m[11], m[14]);
}

The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

Upvotes: 8

Apriori
Apriori

Reputation: 2306

I posted an answer for this same problem awhile back ago for SSE here.

The only things that need to be added are vectorized load/store operations.

This answer is similar to Z boson's answer to this question. Examples for load/store can be seen there. This answer differs because in addition to the SSE3 implementation there is an SSE2 implementation which is guaranteed to run on any x64 processor.

It's worth noting that both of these solutions assume that the entire matrix is row major in memory, but OP's question states that each row could have it's own pointer which implies the array could be fragmented.

Upvotes: 1

user1196549
user1196549

Reputation:

An efficient solution is possible on a 64 bits machine, if you accept that. First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.

ABCD
EFGH
IJKL
MNOP

ABCD
 EFGH
  IJKL
   MNOP

A---
 E---
  I---
   MNOP
=======
AEIMNOP
AEIM

AB--
 -F--
  -J--
   -NOP
=======
ABFJNOP
BFJN

ABC-
 --G-
  --K-
   --OP
=======
ABCGKOP
CGKO

ABCD
 ---H
  ---L
   ---P
=======
ABCDHLP
DHLP

Upvotes: 1

Z boson
Z boson

Reputation: 33659

You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.

//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
    char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
    __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);

    __m128i v = _mm_loadu_si128((__m128i*)x);
    v = _mm_shuffle_epi8(v,mask);
    _mm_storeu_si128((__m128i*)x,v);
    for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
    //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
}

Upvotes: 13

Brandon
Brandon

Reputation: 23498

Not sure about the speed but these are okay.

template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size][Size])
{
    for (int I = 0; I < Size; ++I)
    {
        for (int J = 0; J < I; ++J)
        {
            std::swap(Data[I][J], Data[J][I]);
        }
    }
}

template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size * Size])
{
    for (int I = 0; I < Size; ++I)
    {
        for (int J = 0; J < I; ++J)
        {
            std::swap(Data[I * Size + J], Data[J * Size + I]);
        }
    }
}

Upvotes: 1

Related Questions