Reputation: 119
I'm currently focusing on the implementation of Kyber (ML-KEM).
I noticed that the AVX2 version of the compress operation seems to use a fast division algorithm.
In the compress operation, we need to calculate x * (2**10 / Q)
, where x is some int16_t
input (a poly coefficient in Q prime field), and Q is a prime number (3329 in Kyber, and 7681 in the following code I pasted). The final result should be rounded to its closer bound, say if the results are from 2.1~2.499999..., then they are rounded down to 2 while larger than 2.5 are rounded up to 3.
ref version code here: https://github.com/pq-crystals/kyber/blob/main/ref/polyvec.c#L51
AVX2 version code here: https://github.com/pq-crystals/kyber/blob/main/avx2/polyvec.c#L11
I am confused about how the AVX2 version works, and what is the algorithm behind that.
Besides that, there is another implementation of the compress10
method:
// input "a" is a polyvec
//Here q is 7681, not same as current Kyber (ML-KEM)
const __m256i fdiv = _mm256_set1_epi16(8737);//fast division // floor(2^26 / q + 0.5)
const __m256i hfq = _mm256_set1_epi16(15);
for(int i = 0; i < K; i++) {
for(int j = 0; j < 256; j++) {
d0.vec = _mm256_loadu_si256((__m256i *)&a->vec[i].coeffs[16 * j]);
d0.vec = _mm256_slli_epi16(d0.vec, 2);
d0.vec = _mm256_add_epi16(d0.vec, hfq);
d0.vec = _mm256_mulhi_epi16(d0.vec, fdiv);
d0.vec = _mm256_srli_epi16(d0.vec, 2);
......
}
}
I summarized the formula above as: (((x*4 + 15) * 8737) >> 16) >> 2
.
Its ref version is like:
....
uint16_t t[4];
for (i = 0; i< K; i++)
{
for (j = 0; j< 256/ 4; j++)
{
for (k = 0; k < 4; k++) {
// PARAM_Q here is 7681
t[k] = ((((uint32_t)a->vec[i].coeffs[4 * j + k] << 10) + PARAM_Q / 2) / PARAM_Q) & 0x3ff;
}
// other loop operations
....
I summarized the formula above as: ((x << 10) + q/2) / q
.
The ref version code is quite simple and easy to understand, but the optimized one is really confusing.
And I found out that when the poly coefficient is 5772
, the ref version will get the compressed result as 769
, while the AVX2 version generates 770
. I ran 20 groups of test vectors for AVX2 version and ref version (code I mentioned above, Kyber's both versions are correct all the time), this is the only case that shows different results between the two versions. It seems like a precision issue.
I don't know why and also don't know how the optimized algorithm (both the Kyber version and another version I mentioned above) works (or does not work).
Upvotes: 4
Views: 134