Reputation: 109
Hi I am not so familiar with gpu and I Just have a theoretical question.
So I am working on an application called Sassena, which calculates Neutron scattering from Molecular dynamics trajectories. This application is written in parallel with MPI and works for CPUs very well. But I am willing to run this app over GPU to make it faster. ofcourse not all of it but partly. when I look at the Source Code, The way it works is typical MPI, meaning the first rank send the data to each node individually and then each nodes does the calculation. Now, there is a part of calculation which is using Fast Fourier Transform(FFT), which consumes the most time and I want to send this part to GPU.
I see 2 Solutions ahead of me:
when the nodes reach the FFT part, they should send back the data to the main node, and when the main node gathered all the data it sends them to GPU, then GPU does the FFT, sends it back to cpu and cpu does the rest.
Each node would dynamically send the data to GPU and after the GPU does the FFT, it sends back to each node and they do the rest of their job.
So my Question is which one of these two are possible. I know first one is doable but it is having a lot of communication which is time consuming. But the second way I don't know if it is possible at all or not. I know in the second case it will be dependent on the Computer architecture as well. But is CUDA or OpenCL capable of this at all??
Thanks for any idea.
Upvotes: 3
Views: 144
Reputation: 1085
To my knowledge you are not restricted by CUDA. What you are restricted here is the number of GPUs you have. You need to create some sort of queue that distributes your work to the available GPUs and keeps track of free resources. Depending on ratio between the number of CPUs to the number of GPUs and the amount of time each FFT takes, you may be waiting longer for each FFT to be passed to the GPU compared to just doing it on each core.
What I mean is that you lose the asynchronous computation of FFT which is performed on each core. Rather, CPU 2
have to wait for CPU 1
to finish its FFT computation to be able to initiate a new kernel on GPU.
Other than what I have said, it is possible to create a simple mutex which is locked when a CPU starts computing its FFT and is unlocked when it finishes so that the next CPU can use the GPU.
You can look at StarPU. It is a task based api which can handle sending tasks to GPUs. It is also designed for distributed memory models.
Upvotes: 1