Reputation: 1
it's my first time asking questions here so sorry if something is wrong.
I'm trying to parallelize NTT using cpp threads but I'm simply lost here. I based the code on an article explaining the CUDA parallelization of NTT and adapted it so it made more sense to a processor(fewer threads), but I hit a wall and can't progress.
Basically, made a class to map each thread with what pair of elements in the array it will need to make the butterfly on, the results are wrong and the psis seems to be computed correctly(bit reversed order).
I'm new to parallel computing and NTT, any help is appreciated.
class threadInfo {
public:
std::vector<long long *> psi, u, v;
void clear(){
psi.clear();
u.clear();
v.clear();
}
};
void threadButterfly(threadInfo info, long long mod, bool invert){
for (long long i = 0; i < info.u.size(); i++)
{
long long u = * info.u[i], v = * info.v[i];
if(!invert) v = modulo(v* * info.psi[i],mod);
* info.u[i] = modulo(u+v,mod);
* info.v[i] = modulo(u-v,mod);
if(invert) * info.v[i] = modulo((u-v)* * info.psi[i],mod);
else * info.v[i] = modulo(u-v,mod);
}
}
void threadSched(vector<long long> &a, long long mod, long long len, std::vector<long long> &psi, vector<threadInfo> info, bool invert){
vector<thread> threads(THREAD_NUM);
long long n = a.size();
for (long long id = 0; id < n>>1; id++) //puts each u and v pairs in each thread object
{
long long step = (a.size()/len)/2; // step counts the distance between u and v
long long psi_step = id/step; // what k in psi**k to use relative to the first in the group
long long target = (psi_step * step * 2) + (id % step); // what u and v we want
long long group = len + psi_step; // what k in psi**k to use relative to all psis
long long arrayid = floor((2*id*THREAD_NUM)/n); // what thread will the par go to
info[arrayid].psi.push_back( & psi[group]);
info[arrayid].u.push_back( & a[target]);
info[arrayid].v.push_back( & a[target+step]);
}
for ( size_t id=0; id<THREAD_NUM; id++ ) threads[id] = thread(threadButterfly,info[id],mod,invert);
for ( size_t id=0; id<THREAD_NUM; id++ ) threads[id].join();
for ( long long i = 0; i<info.size(); i++ ) info[i].clear();
if(invert) for ( long long j = 0; j < n; j++ ) a[j]=modulo(a[j]*mod_in(n,mod),mod);
}
void ntt(vector<long long> &a, long long mod, vector<long long> &psi){
vector<threadInfo> fwd(THREAD_NUM);
for (long long len = 1; len < a.size(); len = 2 * len)
{
threadSched(a,mod,len,psi,fwd,false);
}
}
void intt(vector<long long> &a, long long mod, vector<long long> &psi){
vector<threadInfo> rev(THREAD_NUM);
for (long long len = 1; len < a.size(); len = 2 * len)
{
threadSched(a,mod,len,psi,rev,true);
}
}
Thanks for the attention.
EDIT1:
Explaining a bit better, the Number Theoretic Transform is basically a Fourier Transform with integers in a field bounded by a modulo, it changes the representation of a polynomial(ex: 6x^2 + 5x +9) from x-power=index([9,5,6]) to {x,y}={index, variable}([9,20,26]) so that we can multiply polynomials a little bit faster. And to do it we need the primitive root(psis in the code) and its powers, which are based on the size of the polynomial and its modulo.
Inside the Ntt is a butterfly operation, the paper divides the elements of the array inside this operation to each thread of the gpu, so to modify it for cpus, I created a class to symbolize each thread and passed SIZEOFPOLYNOMIAL/THREADNUMBER elements to each thread. When the thread is called, it computes the butterfly for its given array of elements.
I'm using gcc -O3 -std=c++17 -pthread -march=native -lm and I'm testing the NTT with various implementations and the only one that isn't working is this one so the main function(which is enormous) is not important in this situation and would only bloat this post.
Paper: https://eprint.iacr.org/2021/124.pdf
Upvotes: 0
Views: 117