Reputation: 86363
I have this C-code to do multiplications over GF(8):
int32_t GaloisMultiply (int32_t a, int32_t b)
{
int32_t i;
int32_t mask = 0x100;
int32_t y = 0;
for(i=0;i<8;i++)
{
if(b & mask)
{
y ^= a;
}
mask >>= 1;
y <<= 1;
}
if(b & 0x1)
{
y ^= a;
}
return(y);
}
That's more or less the text-book implementation.
I wonder if I there is a clever optimization for above algorithm if I can assert that a is always b, e.g. I do squaring instead of multiplication. I'm not after a cryptographic use btw. I just want to make use of the fact that x*x in GF(8) interleaves the bits of x with zero bits one by one.
There are already quite clever methods to do the bit interleaving, but since I've found out that x*x in GF(8) does the bit interleaving thing (by accident) I can't stop trying to use it for bit-interleaving optimizations.
Any ideas?
Upvotes: 4
Views: 1256
Reputation: 20037
The square x*x mod r(x)
is simply the superposition of all the weights of the bits i in x with w_i = x^{i*2} mod r(x)
.
This can be evaluated with 3 64-bit lookup-tables in parallel.
i0to2 = x & 0b111;
i3to5 = x & 0b111000;
i6to7 = x >> 6;
return (k0to2 >> (i0to2 * 8)) ^
(k3to5 >> i3to5) ^
(k6to7 >> (i6to7 * 8);
This can be reduced to two 3-bit tables, and additional logic knowing the weight for bit 1 == 1 -- that is result ^= (x & 1)
.
Likewise the weigth for bit 4 == r(x) - x^8 leaving just two 64-bit tables k1to3, k5to7
.
Here the encoded tables fit 64-bit register with eg the byte 3 (0b11) in k0to2 would correspond to w_0 ^ w_1 == 5 (for all polynomials r(x)).
Upvotes: 1
Reputation:
Lookup table is definitely the fastest for polynomial basis galois squaring. It is also the fastest for multiplication when using GF(8), but the tables get too large for larger fields as used in ECC. For multiplication in larger fields, the best algorithm is the 'left to right combine' method...(see http://www.amazon.com/Elliptic-Cryptography-Springer-Professional-Computing/dp/038795273X algorithm 2.36, page 50).
Upvotes: 1
Reputation:
int32_t GaloisMultiply( int32_t a )
{
int32_t y = 0;
int32_t b = a & 0x01ff;
while ( b )
{
if ( b & 1 )
y ^= a;
a <<= 1;
b >>= 1;
}
return y;
}
Or if you like:
int32_t GaloisMultiply( int32_t a )
{
int32_t y = 0;
for ( int32_t b = a & 0x01ff; b; b >>= 1 )
{
if ( b & 1 )
y ^= a;
a <<= 1;
}
return y;
}
The reason that this approach is more efficient than the original code above is primarily because the loop is only performed until all the 'interesting' bits in the argument are consumed as opposed to blindly checking all (9) bits.
A table based approach will be faster though.
Upvotes: 1
Reputation: 89671
Table-based? link
And when you are limited to x*x, it's a sparse matrix.
Here's another good paper (and a library)
Upvotes: 4
Reputation: 7827
It might help the compiler a bit to mark "a" and "b" as const. Or unrolling the loop by hand. It would be sad if it helped, though...
Isn't it a patent minefield, by the way ?
Upvotes: 0
Reputation: 35550
This is probably not what you are looking for, but here's one minor speedup:
Pass only one argument, if they are guaranteed to be the same.
Upvotes: 0
Reputation: 400364
You could probably write some assembly to do a slightly better job. However, I'd be pretty surprised if this was the bottleneck in your application; have you done any profiling? This function doesn't seem like it's worth optimizing.
Upvotes: 0