How to efficiently repeat a vector to a matrix in cuda?

Question

I want to repeat a vector to form a matrix in cuda, avoiding too many memcopy. Both vector and matrix are allocated on GPU.

For example:

I have a vector:

a = [1 2 3 4]

expand it to a matrix:

b = [1 2 3 4;
     1 2 3 4;
     .......
     1 2 3 4]

What I have tried is to assign each element of b. But this involves a lot of GPU memory to GPU memory copy.

I know this is easy in matlab (using repmat), but how to do it in cuda efficiently? I didn't find any routine in cublas.

Robert Crovella · Accepted Answer

EDIT based on the comments, I've updated the code to a version that will handle either row-major or column-major underlying storage.

Something like this should be reasonably fast:

// for row_major, blocks*threads should be a multiple of vlen
// for column_major, blocks should be equal to vlen
template 
__global__ void expand_kernel(const T* vector, const unsigned vlen, T* matrix, const unsigned mdim, const unsigned col_major=0){
  if (col_major){
    int idx = threadIdx.x+blockIdx.x*mdim;
    T myval = vector[blockIdx.x];
    while (idx < ((blockIdx.x+1)*mdim)){
      matrix[idx] = myval;
      idx += blockDim.x;
      }
    }
  else{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    T myval = vector[idx%vlen];
    while (idx < mdim*vlen){
      matrix[idx] = myval;
      idx += gridDim.x*blockDim.x;
      }
    }
}

This assumes your matrix is of dimensions mdim rows x vlen columns (seems to be what you have outlined in the question.)

You can tune the grid and block dimensions to find out what works fastest for your particular GPU. For the row-major case, start with 256 or 512 threads per block, and set the number of blocks equal to or greater than 4 times the number of SMs in your GPU. Choose the product of grid and block dimensions to be equal to an integer multiple of your vector length vlen. If this is difficult, choosing an arbitrary, but "large" threadblock size, such as 250 or 500, should not result in much lost efficiency.

For the column-major case, choose 256 or 512 threads per block, and choose the number of blocks equal to vlen, the vector length. If vlen > 65535, you will need to compile this for compute capability 3.0 or higher. If vlen is small, perhaps less than 32, the efficiency of this method may be significantly reduced. Some mitigation will be found if you increase the threads per block to the maximum for your GPU, either 512 or 1024. There may be other "expand" realizations that may be better suited to the column-major "narrow" matrix case. For example, a straightforward modification to the column-major code would allow two blocks per vector element, or four blocks per vector element, and the total launched blocks would then be 2*vlen or 4*vlen, for example.

Here's a fully worked example, along with a run of bandwidth test, to demonstrate that the above code achieves ~90% of the throughput indicated by bandwidthTest:

$ cat t546.cu
#include 

#define W 512
#define H (512*1024)
// for row_major, blocks*threads should be a multiple of vlen
// for column_major, blocks should be equal to vlen
template 
__global__ void expand_kernel(const T* vector, const unsigned vlen, T* matrix, const unsigned mdim, const unsigned col_major=0){
  if (col_major){
    int idx = threadIdx.x+blockIdx.x*mdim;
    T myval = vector[blockIdx.x];
    while (idx < ((blockIdx.x+1)*mdim)){
      matrix[idx] = myval;
      idx += blockDim.x;
      }
    }
  else{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    T myval = vector[idx%vlen];
    while (idx < mdim*vlen){
      matrix[idx] = myval;
      idx += gridDim.x*blockDim.x;
      }
    }
}

template 
__global__ void check_kernel(const T* vector, const unsigned vlen, T* matrix, const unsigned mdim, const unsigned col_major=0){
  unsigned i = 0;
  while (i<(vlen*mdim)){
    unsigned idx = (col_major)?(i/mdim):(i%vlen);
    if (matrix[i] != vector[idx]) {printf("mismatch at offset %d
",i); return;}
    i++;}
}

int main(){

  int *v, *m;
  cudaMalloc(&v, W*sizeof(int));
  cudaMalloc(&m, W*H*sizeof(int));
  int *h_v = (int *)malloc(W*sizeof(int));
  for (int i = 0; i < W; i++)
    h_v[i] = i;
  cudaMemcpy(v, h_v, W*sizeof(int), cudaMemcpyHostToDevice);

  // test row-major

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);
  expand_kernel<<<44, W>>>(v, W, m, H);
  cudaEventRecord(stop);
  float et;
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  printf("row-majortime: %fms, bandwidth: %.0fMB/s
", et, W*H*sizeof(int)/(1024*et));
  check_kernel<<<1,1>>>(v, W, m, H);
  cudaDeviceSynchronize();
  // test col-major

  cudaEventRecord(start);
  expand_kernel<<>>(v, W, m, H, 1);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  printf("col-majortime: %fms, bandwidth: %.0fMB/s
", et, W*H*sizeof(int)/(1024*et));
  check_kernel<<<1,1>>>(v, W, m, H, 1);
  cudaDeviceSynchronize();
  return 0;
}

$ nvcc -arch=sm_20 -o t546 t546.cu
$ ./t546
row-majortime: 13.066944ms, bandwidth: 80246MB/s
col-majortime: 12.806720ms, bandwidth: 81877MB/s
$ /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5864.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6333.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     88178.6

Result = PASS
$

CUDA 6.5, RHEL 5.5

This can also be implemented using a CUBLAS Rank-1 update function but it will be considerably slower than the above method.

How to efficiently repeat a vector to a matrix in cuda?

Answers (1)

Related Questions