Daniel
Daniel

Reputation: 1825

Transposing a 4x4 matrix represented as a ulong value(as fast as possible)

I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.

The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.

Until recently I was using a 4x4 byte matrix:

var field = new byte[4,4];

Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.

Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.

It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:

    //flip horizontally
    const byte SZ = 4;
    public static byte[,] Invert(this byte[,] squares)
    {
        var tmp = new byte[SZ, SZ];
        for (byte x = 0; x < SZ; x++)
            for (byte y = 0; y < SZ; y++)
                tmp[x, y] = squares[x, SZ - y - 1];
        return tmp;
    }

I can do this inversion to a ulong ~15x faster:

    public static ulong Invert(this ulong state)
    {
        ulong c1 = state & 0xF000F000F000F000L;
        ulong c2 = state & 0x0F000F000F000F00L;
        ulong c3 = state & 0x00F000F000F000F0L;
        ulong c4 = state & 0x000F000F000F000FL;

        return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
    }

Note the use of hex, which is extremely useful because each character represents a tile.

The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:

    public static byte[,] Transpose(this byte[,] squares)
    {
        var tmp = new byte[SZ, SZ];
        for (byte x = 0; x < SZ; x++)
            for (byte y = 0; y < SZ; y++)
                tmp[y, x] = squares[x, y];
        return tmp;
    }

The fastest way I've found to do this is using this bit of ridiculousness:

    public static ulong Transpose(this ulong state)
    {
        ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

        result |= (state & 0x0F00000000000000L) >> 12;
        result |= (state & 0x00F0000000000000L) >> 24;
        result |= (state & 0x000F000000000000L) >> 36;
        result |= (state & 0x0000F00000000000L) << 12;
        result |= (state & 0x000000F000000000L) >> 12;
        result |= (state & 0x0000000F00000000L) >> 24;
        result |= (state & 0x00000000F0000000L) << 24;
        result |= (state & 0x000000000F000000L) << 12;
        result |= (state & 0x00000000000F0000L) >> 12;
        result |= (state & 0x000000000000F000L) << 36;
        result |= (state & 0x0000000000000F00L) << 24;
        result |= (state & 0x00000000000000F0L) << 12;

        return result;
    }

Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.

Upvotes: 3

Views: 317

Answers (2)

user555045
user555045

Reputation: 64904

An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".

For example the moves left by 12 and 24 could be done as:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;

That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:

public static ulong Transpose(this ulong state)
{
    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
    r0 |= (state & 0x00F0000F00000000L) >> 24;
    r1 |= (state & 0x000F000000000000L) >> 36;
    r0 |= (state & 0x000000000000F000L) << 36;
    r1 |= t & 0x0FF000FF000F0000UL;

    return r0 | r1;
}

Upvotes: 1

Aldert
Aldert

Reputation: 4313

you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:

public static ulong Transpose(this ulong state)
        {
            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

            result |= (state & 0x0F0000F0000F0000L) >> 12;
            result |= (state & 0x00F0000F00000000L) >> 24;
            result |= (state & 0x000F000000000000L) >> 36;
            result |= (state & 0x0000F0000F0000F0L) << 12;
            //result |= (state & 0x000000F000000000L) >> 12;
            //result |= (state & 0x0000000F00000000L) >> 24;
            result |= (state & 0x00000000F0000F00L) << 24;
            //result |= (state & 0x000000000F000000L) << 12;
            //result |= (state & 0x00000000000F0000L) >> 12;
            result |= (state & 0x000000000000F000L) << 36;
            //result |= (state & 0x0000000000000F00L) << 24;
            //result |= (state & 0x00000000000000F0L) << 12;

            return result;
        }

Upvotes: 2

Related Questions