user1999728
user1999728

Reputation: 903

Best approach for convolution of multiple small matrices using CUDA

I need to preform multiple convolutions with small matrices and kernels, and I was hoping that utilizing the many processors of the GPU would enable me to it as fast as possible.

The problem is as follows: I have many matrices (~1,000 to ~10,000) or relatively small sizes (~15x15 down to 1x1 - as in scalar), and a certain number of convolution masks (~20 to 1). I need to convolve all the matrices with each convolution mask example:

A; %5,000 matrices of size 10x10, A(i) = a 10x10 matrix
B; 10 matrices of size 5x5, B(k) = a 5x5 matrix
res(j)=conv(A,B(1)); %res(j) is the result of convolving all 5,000
%matrices in A by the j'th kernel B(j)

the goal is computing res(1),...,res(10) as quickly as possible

I would like to hear suggestions about how to implement the most efficient algorithm. FFT based convolution would probably be too slow.

Every implementation I've seen so far is for 2d convolution, meant to convolve 2 large matrices, while I need to convolve many small matrices.

I know very little about CUDA programming right now, but I'm in the process of learning.

I was hoping to figure this out myself, but due to time constraints, I am forced to ask for any advice anyone with experience can give me, while I learn how to code in CUDA.

Thank you!

p.s. any pointers to an implementation that suits my purposes is more than appreciated. I am a university students, and this is for a small research project, so nothing I need to pay for please...

Upvotes: 1

Views: 1458

Answers (2)

arrayfire
arrayfire

Reputation: 1754

One approach is to use ArrayFire's GFOR loop, which I work on.

You can tile as many small convolutions into one big kernel launch as you want, as long as you don't run out of GPU memory, as follows:

array x = randu(5);      // the input
array y = randu(m,5);    // the output
array f = constant(1,3); // the kernel
gfor (array k, 0, m-1) {
    y(span,k) = convolve(x,f);
}

Good luck!

Upvotes: 0

Vitality
Vitality

Reputation: 21475

I do not pretend to give an ultimate answer to your question, but I would just like to point out a couple of things:

  1. As you mentioned, a first possibility would be to use the FFT approach. A problem on this line is that (correct me if I'm wrong) the cuFFT library is primarily designed to cope with large matrices, so to fruitfully benefit from this approach would be developing FFT routines efficient for small matrices. I just want to indicate that there are some algorithms of this kind, please see for example the paper: Small Discrete Fourier Transforms on GPUs. I have no direct experience with the performance of CUDA FFTs on small matrices of the indicated type, but perhaps it could be interesting for you since the mask matrices are in a low number (10) and so you can "recycle" their FFTs for a large number of convolutions (5000).
  2. If you decide not to use the FFT approach, then, if you have a GPU architecture with compute capability >=3.5, then dynamic parallelism could be a good candidate to calculate convolutions. If you regard the evaluation of each convolution matrix element as an interpolation, then you will have interpolation problems of size 15x15 and dynamic parallelism could help, see the post: Benefit of splitting a big CUDA kernel and using dynamic parallelism

Upvotes: 2

Related Questions