Nils Pipenbrinck
Nils Pipenbrinck

Reputation: 86363

Optimize y = x*x in Galois field arithmetic

I have this C-code to do multiplications over GF(8):

int32_t GaloisMultiply (int32_t a, int32_t b) 
{
    int32_t i;
    int32_t mask = 0x100;
    int32_t y = 0;

    for(i=0;i<8;i++) 
    {
        if(b & mask) 
        {
            y ^= a;
        }
        mask >>= 1;
        y <<= 1;
    }

    if(b & 0x1) 
    {
        y ^= a;
    }

    return(y);
}

That's more or less the text-book implementation.

I wonder if I there is a clever optimization for above algorithm if I can assert that a is always b, e.g. I do squaring instead of multiplication. I'm not after a cryptographic use btw. I just want to make use of the fact that x*x in GF(8) interleaves the bits of x with zero bits one by one.

There are already quite clever methods to do the bit interleaving, but since I've found out that x*x in GF(8) does the bit interleaving thing (by accident) I can't stop trying to use it for bit-interleaving optimizations.

Any ideas?

Upvotes: 4

Views: 1256

Answers (7)

Aki Suihkonen
Aki Suihkonen

Reputation: 20037

The square x*x mod r(x) is simply the superposition of all the weights of the bits i in x with w_i = x^{i*2} mod r(x). This can be evaluated with 3 64-bit lookup-tables in parallel.

  i0to2 = x & 0b111;
  i3to5 = x & 0b111000;
  i6to7 = x >> 6;
  return (k0to2 >> (i0to2 * 8)) ^ 
         (k3to5 >> i3to5) ^
         (k6to7 >> (i6to7 * 8);

This can be reduced to two 3-bit tables, and additional logic knowing the weight for bit 1 == 1 -- that is result ^= (x & 1). Likewise the weigth for bit 4 == r(x) - x^8 leaving just two 64-bit tables k1to3, k5to7. Here the encoded tables fit 64-bit register with eg the byte 3 (0b11) in k0to2 would correspond to w_0 ^ w_1 == 5 (for all polynomials r(x)).

Upvotes: 1

Phil
Phil

Reputation:

Lookup table is definitely the fastest for polynomial basis galois squaring. It is also the fastest for multiplication when using GF(8), but the tables get too large for larger fields as used in ECC. For multiplication in larger fields, the best algorithm is the 'left to right combine' method...(see http://www.amazon.com/Elliptic-Cryptography-Springer-Professional-Computing/dp/038795273X algorithm 2.36, page 50).

Upvotes: 1

who
who

Reputation:

int32_t GaloisMultiply( int32_t a ) 
{
  int32_t y = 0;
  int32_t b = a & 0x01ff;

  while ( b ) 
  {
    if ( b & 1 ) 
      y ^= a;

    a <<= 1;
    b >>= 1;
  }
  return y;
}

Or if you like:

int32_t GaloisMultiply( int32_t a ) 
{
  int32_t y = 0;
  for ( int32_t b = a & 0x01ff; b; b >>= 1 )
  {
    if ( b & 1 ) 
      y ^= a;

    a <<= 1;
  }
  return y;
}

The reason that this approach is more efficient than the original code above is primarily because the loop is only performed until all the 'interesting' bits in the argument are consumed as opposed to blindly checking all (9) bits.

A table based approach will be faster though.

Upvotes: 1

Cade Roux
Cade Roux

Reputation: 89671

Table-based? link

And when you are limited to x*x, it's a sparse matrix.

Here's another good paper (and a library)

Upvotes: 4

rlerallut
rlerallut

Reputation: 7827

It might help the compiler a bit to mark "a" and "b" as const. Or unrolling the loop by hand. It would be sad if it helped, though...

Isn't it a patent minefield, by the way ?

Upvotes: 0

AShelly
AShelly

Reputation: 35550

This is probably not what you are looking for, but here's one minor speedup:

Pass only one argument, if they are guaranteed to be the same.

Upvotes: 0

Adam Rosenfield
Adam Rosenfield

Reputation: 400364

You could probably write some assembly to do a slightly better job. However, I'd be pretty surprised if this was the bottleneck in your application; have you done any profiling? This function doesn't seem like it's worth optimizing.

Upvotes: 0

Related Questions