Constant time for multiplication in Galois Field GF(4)

Question

Tl;dr: Is there a way to improve the code below in any way (including multithreading) as the code will run hundreds of billions of times?

To objective is to find a constant time algorithm (without a for loop) for performing multiplication in Galois Field GF(4). I am not sure if this is even possible but it is worth a try.

Some background: multiplication in GF(2) or base 2 is the equivalent of anding the two values being multiplied. This is because:

a	b	a × b = a ∧ b
0	0	0
0	1	0
1	0	0
1	1	1

For example:

10101011010100 × 10011000101101 = 

10101011010100 
10011000101101 ∧
--------------
10001000000100

When it comes to GF(4), there are four different symbols that can be used: 0, 1, 2 and 3. It is not the same as performing multiplication in base 4 because some digits don't give an expected result when multiplied by other digits. They are bolded in the table below:

a	b	a × b
0	0	0
0	1	0
0	2	0
0	3	0
1	0	0
1	1	1
1	2	2
1	3	3
2	0	0
2	1	2
2	2	3
2	3	1
3	0	0
3	1	3
3	2	1
3	3	2

A more compact form of the above table can be summarized using following multiplication table:

×	1	2	3
0	0	0	0
1	1	2	3
2	2	3	1
3	3	1	2

We can write each of the four digits in binary as multiplication will be performed on the binary representation of the values:

Digit	Binary representation
0	00
1	01
2	10
3	11

In GF(4), multiplication is done by multiplying digit by digit without carry. For example:

21302032 × 31012233 = 

21302032
31012233 ×
--------
11003021

We can use the binary representation of the values and perform the multiplication:

21302032 = 1001110010001110 in binary
31012233 = 1101000110101111 in binary
11003021 = 0101000011001001 in binary

1001110010001110
1101000110101111 ×
----------------
0101000011001001

Lastly, here is an implementation of a working java code that performs the multiplication. However, it uses a for loop and the goal is to come up with constant time algorithm:

public class Multiplication {
    public static void main(String[] args) {
        final byte[][] MUL_ARRAY = new byte[][]{
                {0, 0, 0, 0},
                {0, 1, 2, 3},
                {0, 2, 3, 1},
                {0, 3, 1, 2}
        };
        long mask;
        byte shift = 2;

        //long is 64 bits which means it can store 32 digits quaternary value.
        int  n      = 8;                 //# of quaternary digits (ALWAYS given)
        long v1     = 0b1001110010001110;//21302012 in base 4
        long v2     = 0b1101000110101111;//31012233 in base 4
        long result = 0;

        for (int i = 0; i < n; i++) {
            //get far-right quaternary digit of the two vectors:
            mask = 0b11;
            mask = mask << 2 * (n - i - 1);
            long v1DigitPadded = v1 & mask;//correct digit with zero padding
            long v2DigitPadded = v2 & mask;//correct digit with zero padding
            //shift the digits so that the digit needed is at far-right
            v1DigitPadded = v1DigitPadded >>> 2 * (n - i - 1);
            v2DigitPadded = v2DigitPadded >>> 2 * (n - i - 1);
            //The actual quaternary digit
            byte v1Digit     = (byte) v1DigitPadded;
            byte v2Digit     = (byte) v2DigitPadded;
            byte product     = MUL_ARRAY[v1Digit][v2Digit];
            long resultDigit = product << 2 * (n - i - 1);
            result = result | resultDigit;
        }
        //desired output: 0101000011001001
        //prints the value in binary with zeros padding on the left
        String s      = Long.toBinaryString(result);
        String output = String.format("%" + n * 2 + "s", s).replace(" ", "0");
        System.out.println("The output is: " + output);
    }
}

Is there an algorithm for that? If not, are there some improvements that can help in my logic (maybe an efficient multithreading approach)?

Mark Dickinson · Accepted Answer

The answer to the question "Can a formula (without a for loop) be used to achieve the result?" is yes.

Here's some code. It's in Python, but it should translate directly to Java with no major modifications needed. Examples and explanations follow the code. It's written assuming 64-bit integers encoding 32 elements of GF(4) each - if you want 32-bit integers, use a MASK of 0x5555_5555 instead.

#: Mask to extract the low-order bit of each GF(4) element.
MASK = 0x5555_5555_5555_5555


def gf4_vector_multiply(i, j):
    """
    Vector multiply in GF(4).

    i and j are 64-bit integers, each of which encodes 32 elements
    of GF(4) in pairs of consecutive bits.

    The returned value represents the 32 products of the GF(4) elements,
    encoded as a 64-bit integer in the same way.
    """
    ii = i >> 1
    jj = j >> 1
    r0 = (ii & jj) ^ (i & j)
    r1 = (ii & j) ^ (jj & i) ^ (ii & jj)
    return ((r1 & MASK) << 1) ^ (r0 & MASK)

Example

Here's the above function applied to the example you give, multiplying 21302032 and 31012233 to get 11003021:

>>> i = int("21302032", base=4)
>>> j = int("31012233", base=4)
>>> product = gf4_vector_multiply(i, j)
>>> product == int("11003021", base=4)
True

Explanation

We compute the low bit of each product (r0 in the above code) and the high bit (r1) separately, and then combine them.

For the low bit of the product, we have the following table (copied from the table in the question, but replacing 2s and 3s with 0s and 1s):

×	1	2	3
0	0	0	0
1	1	0	1
2	0	1	1
3	1	1	0

The values in this table are the bitwise exclusive-ors of the corresponding values in the following two tables:

×	1	3
0	0	0
1	1	1
2	0	0
3	1	1

and

×	2	3
0	0	0
1	0	0
2	1	1
3	1	1

The first of these two tables is just the bitwise-and of the low-order bits of the inputs (i & j in the code), while the second is the bitwise-and of the high-order bits of the inputs (ii & jj in the code). So (ii & jj) ^ (i & j) gives us the low-order bit of the result. The high-order bit r1 is constructed similarly.

Note that the interesting parts of r0 and r1 are in the low-order bits of each pair, that is, in r0 & 0x55555555 and r1 & 0x55555555. The high-order bits of each pair (r0 & 0xaaaaaaaa and r1 & 0xaaaaaaaa) aren't useful: their values contain the results of mixing consecutive GF(4) digits, which isn't what we want, so we have to be careful to mask those bits out before reassembling the result.

Optimisation

I haven't made much effort to minimise the number of bitwise operations here; there's probably some scope for reducing the total number of bit operations. The code above uses 14 operations. Here's a variant that's somewhat less clear, but uses 12 operations instead. Note that with normal operator precedence rules (both in Java and Python), all but one of the pairs of parentheses is redundant: I find the code more readable with the parentheses, but feel free to strip them out if you think it looks better that way.

def gf4_vector_multiply(i, j):
    """
    Vector multiply in GF(4).

    i and j are 64-bit integers, each of which encodes 32 elements
    of GF(4) in pairs of bits.

    The returned value represents the 32 products of the GF(4) elements,
    encoded in the same way.
    """
    ii = (i >> 1) & MASK
    jj = (j >> 1) & MASK
    return (((ii & j) ^ (jj & i)) << 1) ^ (ii & jj) ^ (i & j)

Also, as you suggest, if you have an embarrassingly parallel problem then using multithreading may well help, but I think that's really an investigation for a separate Stack Overflow question. On a modern CPU, it may well be that the above code is simple enough that the bottleneck will be getting data in and out of RAM rather than computation.

Special case n = 1

In comments, the question author asked about single-digit multiplication. For a single digit multiply, there are other tricks available. Here's one possibility (again in Python, but again it should translate directly to Java):

def gf4_scalar_multiply(i, j):
    return (0x813e4 >> 2*i*j) & 3

Here we've simply encoded the product values in pairs of bits in the magic constant 0x813e4, which in base 4 is 2001033210. Now we use a shift to look up the appropriate base 4 product given the integer product.

Let's check that this gives the expected multiplication table:

>>> for i in range(4):
...     for j in range(4):
...          print(gf4_scalar_multiply(i, j), end=" ")
...     print()
... 
0 0 0 0 
0 1 2 3 
0 2 3 1 
0 3 1 2

Constant time for multiplication in Galois Field GF(4)

Answers (1)

Example

Explanation

Optimisation

Special case n = 1

Related Questions