Lu4
Lu4

Reputation: 15032

Binary matrix multiplication bit twiddling hack

Abstract

Hi, suppose you have two different independent 64-bit binary matrices A and T (T is another matrix that is stored in transposed form, using the transposed version of matrix allows during multiplication to operate on T's rows rather than columns which is super cool for binary arithmetic) and you want to multiply these matrices the only thing is that matrix multiplication result is truncated to 64-bits and if you yield to a value greater that 1 in some specific matrix cell the resulting matrix cell will contain 1 otherwise 0

Example

   A        T
00000001 01111101 
01010100 01100101 
10010111 00010100 
10110000 00011000 <-- This matrix is transposed
11000100 00111110 
10000011 10101111 
11110101 11000100 
10100000 01100010 

Binary and traditional multiplication results:

 Binary  Traditional
11000100  11000100
11111111  32212121
11111111  32213421
11111111  21112211
11101111  22101231
11001111  11001311
11111111  54213432
11001111  11001211

Question

How do you multiply these matrices in a way described above in most efficient matter?

P.S

I was trying to take advantage of binary and (i.e. & operator) instead of performing multiplication on separate bits, in that case I had to prepare data for multiplication:

ulong u;

u = T & 0xFF;
u = (u << 00) + (u << 08) + (u << 16) + (u << 24)
  + (u << 32) + (u << 40) + (u << 48) + (u << 56);

now by performing binary and over two integers A and u it would yield to the following:

   A        u        R        C
00000001 01111101 00000001    1
01010100 01111101 01010100    3
10010111 01111101 00010101    3
10110000 01111101 00110000    2
11000100 01111101 01000100    2
10000011 01111101 00000001    1
11110101 01111101 01110101    5
10100000 01111101 00100000    1

In the example above R contains result of multiplication of A bits to u bits and to obtain the final value we must sum all bits in a row. Notice that column C contains values equal to ones found in first column of resulting Traditional matrix multiplication above. The problem is that during this step I have to operate on a separate bits which I think is sub-optimal approach, I've read through http://graphics.stanford.edu/~seander/bithacks.html looking for a way to do that on parallel but no luck, if anyone has any idea on how to "flatten" and "merge" the values located in R column into resulting 64-bit matrix, I would appreciate if you drop me several lines,

Thank you,

Edit

With big thank you to David Eisenstat, the final algorithm would then look like:

var A = ...;
var T = ...; // T == transpose(t), t is original matrix, algorithm works with transposed matrix

var D = 0x8040201008040201UL;

U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D); T = (T << 8) | (T >> 56); D = (D << 8) | (D >> 56);
U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & D);

The following piece of code:

    public static void Main (string[] args){
        ulong U;
        var Random = new Xor128 ();

        var timer = DateTime.Now;

        var A = Random.As<IUniformRandom<UInt64>>().Evaluate();
        var T = Random.As<IUniformRandom<UInt64>>().Evaluate();

        var steps = 10000000;

        for (var i = 0; i < steps; i++) {
            ulong r = 0;

            var d = 0x8040201008040201UL;

            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d); T = (T << 8) | (T >> 56); d = (d << 8) | (d >> 56);
            U = A & T; U |= U >> 1; U |= U >> 2; U |= U >> 4; U &= 0x0101010101010101UL; U = (U << 8) - U; r |= (U & d);
        }

        Console.WriteLine (DateTime.Now - timer);


        var m1 = new Int32[8,8];
        var m2 = new Int32[8,8];
        var m3 = new Int32[8,8];

        for (int row = 0; row < 8; row++) {
            for (int col = 0; col < 8; col++) {
                m1 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
                m2 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
                m3 [row, col] = Random.As<IUniformRandom<Int32>> ().Evaluate(0, 1);
            }
        }

        timer = DateTime.Now;

        for (int i = 0; i < steps; i++) {
            for (int row = 0; row < 8; row++) {
                for (int col = 0; col < 8; col++) {
                    var sum = 0;

                    for (int temp = 0; temp < 8; temp++) {
                        sum += m1 [row, temp] * m2 [temp, row];
                    }

                    m3 [row, col] = sum;
                }
            }
        }

        Console.WriteLine (DateTime.Now - timer);
    }

Shows me the following results:

00:00:02.4035870
00:00:57.5147150

And that's a 23x performance improvement under Mac OS X / Mono, thanks everyone

Upvotes: 11

Views: 4723

Answers (5)

paperclip optimizer
paperclip optimizer

Reputation: 485

The solution for strictly boolean algebra can be achieved pretty efficiently on an x86-64 using the solution I described here:

https://stackoverflow.com/a/55307540/11147804

The only difference is the data from the transposed matrix needs to be extracted also by columns and repacked to rows before each 64-bit product. Fortunately this is trivial to do using the BMI2 instruction for parallel bit extract, accessible on GCC with the intrinsic _pext_u64:

uint64_t mul8x8T (uint64_t A, uint64_t B) {
    
    const uint64_t COL = 0x0101010101010101;
    
    uint64_t C = 0;
    
    for (int i=0; i<8; ++i) {
        uint64_t p = COL & (A>>i); // select column
        uint64_t r = torow( COL & (B>>i) );
        C |= (p*r); // use ^ for GF(2) instead
    }
    return C;
}
    
    
uint64_t torow (uint64_t c) {
    const uint64_t ROW = 0x00000000000000FF; // mask of the first row
    const uint64_t COL = 0x0101010101010101; // mask of the first column
    
    // select bits of c in positions marked by COL,
    // and pack them consecutively
    // last 'and' is included for clarity and is not 
    // really necessary 
    return _pext_u64(c, COL) & ROW;
}

In processors which do not support this particular instruction one possible solution is to adapt the typical bit trick for packing, which is used for example in the classic bit order reversal of a byte using 64-bit multiplication:

https://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64BitsDiv

Using masks and integer multiplication with some constant results in a quadword containing the packed result as a bit substring which can be then extracted using a bit shift and a mask.

The idea is to think of the multiplication step as a parallel bit shift where every bit in the input is shifted by a different amount, specified in the constant. This is always possible as long as the strides of both numbers do not collide on some position in the result, i.e. as long as each partial sum from the multiplication updates different bit positions in the result. This avoids any potential carries, which makes the bit-by-bit sum equivalent to bit-parallel OR (or XOR).

uint64_t torow (uint64_t c) {
    const uint64_t ROW = 0x00000000000000FF; // select 8 lowest consecutive bits to get the first row
    const uint64_t COL = 0x0101010101010101; // select every 8th bit to get the first column
    const uint64_t DIA = 0x8040201008040201; // select every 8+1 bit to obtain a diagonal

    c *= ROW; // "copies" first column to the rest
    c &= DIA; // only use diagonal bits or else there will be position collisions and unexpected carries
    c *= COL; // "scatters" every bit to all rows after itself; the last row will now contain the packed bits
    return c >> 56; // move last row to first & discard the rest
}

There are other possible implementations of this function using more operations of lower strength, the fastest of which will depend on the target architecture.

A different approach is to transpose the matrix, which can be performed in O(3) steps very efficiently exploiting bit-level parallelism:

uint64_t transpose8x8 (uint64_t A) {

    const uint64_t M2x2 = 0xF0F0F0F00F0F0F0F; // odd-even block mask for 2x2 blocks of 4x4 bools
    const uint64_t M4x4 = 0xCCCC3333CCCC3333; // odd-even block mask for 4x4 blocks of 2x2 bools
    const uint64_t M8x8 = 0xAA55AA55AA55AA55; // odd-even block mask for 8x8 blocks of 1x1 bools

    uint64_t T1, T2, T3;
    T1 = (A  & M2x2) | (A  << (32-4)) &~M2x2 | (A  >> (32-4)) &~M2x2 ;
    T2 = (T1 & M4x4) | (T1 << (16-2)) &~M4x4 | (T1 >> (16-2)) &~M4x4 ;
    T3 = (T2 & M8x8) | (T2 << ( 8-1)) &~M8x8 | (T2 >> ( 8-1)) &~M8x8 ;

    return T3;
}

It may look convoluted, but the idea behind it is quite simple; the 8x8 matrix is partitioned in 2x2 blocks of 4x4 elements each, and the upper right block is swapped with the lower left one. Once we work out the correct bit masks and necessary shifts for our 64-bit representation we can apply the algorithm recursively down to blocks of 1x1 element to obtained the transposed matrix.

Then performing the matrix product just requires transposing the right matrix and using the original mul8x8 algorithm linked above.

Upvotes: 4

hivert
hivert

Reputation: 10667

If you are allowing more low level construction that C/C++ then SSE/AVX machine instructions together with intrinsic compiler functions allow to write much faster code (4x according to some benchmark I made). You need to use a non standard vector variable (supported at least by GCC, ICC, CLang):

using epu = uint8_t __attribute__((vector_size(16)));

I'm using a class such as

class BMat8 {
    [...]
  private:
    uint64_t _data;
};

then, the following code should do what you want

static constexpr epu rothigh { 0, 1, 2, 3, 4, 5, 6, 7,15, 8, 9,10,11,12,13,14};
static constexpr epu rot2    { 6, 7, 0, 1, 2, 3, 4, 5,14,15, 8, 9,10,11,12,13};

inline BMat8 operator*(BMat8 const& tr) const {
  epu x = _mm_set_epi64x(_data, _data);
  epu y = _mm_shuffle_epi8(_mm_set_epi64x(tr._data, tr._data), rothigh);
  epu data {};
  epu diag =  {0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,
               0x80,0x01,0x02,0x04,0x08,0x10,0x20,0x40};
  for (int i = 0; i < 4; ++i) {
    data |= ((x & y) != epu {}) & diag;
    y    = _mm_shuffle_epi8(y, rot2);
    diag = _mm_shuffle_epi8(diag, rot2);
  }
  return BMat8(_mm_extract_epi64(data, 0) | _mm_extract_epi64(data, 1));
}

In particular, using 128 bits register, I'm able to do two iterations at once.

Upvotes: 2

bavaza
bavaza

Reputation: 11037

It is not clear what data structure you are using, which language (yes, I know you said 'any language'), and what you are trying to optimize (speed? memory?) etc. All of these may have profound impact on your solution.

Some examples:

  • Say this was C/C++, and your matrices are continues bits in memory. Each row/column maps to a UINT8. In this case, multiplying a row with a column reduces to doing an 8-bit bitwise-&, and checking if the result is greater than 0 (no need to sum the bits). This takes 2 processor instruction.
  • If you are forced to do bit-by-bit operations, use the bitwise 'or' (|) instead of +. Some languages may lazy evaluate this, stopping at the first '1' they encounter.
  • If you can multi-thread, you could speedup calculations.

BTW, I'm assuming you have a lot of matrices to process, otherwise I would use a direct, and readable code. My guess is that even with a lot of matrices, the gain in performance would be negligible.

Upvotes: 2

David Eisenstat
David Eisenstat

Reputation: 65427

I'm not sure about most efficient, but here's something to try. The following sequence of instructions computes the main diagonal of the product A * T'. Rotate both T and D by 8 bits and repeat for 7 more iterations.

// uint64_t A, T;
uint64_t D = UINT64_C(0x8040201008040201);
uint64_t P = A & T;
// test whether each byte is nonzero
P |= P >> 1;
P |= P >> 2;
P |= P >> 4;
P &= UINT64_C(0x0101010101010101);
// fill each nonzero byte with ones
P *= 255;  // or P = (P << 8) - P;
// leave only the current diagonal
P &= D;

Upvotes: 7

fumigail
fumigail

Reputation: 853

If you are looking for a way to do dense matrix multiplication in parallel, partition your result matrix into blocks and compute each block in parallel.

http://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication

Upvotes: 3

Related Questions