alae
alae

Reputation: 140

Parallel implementation of the computation of the sum of large contiguous subsequences in a large array using Cuda

I have a large array of 1000 elements, I want to compute the sum of large contiguous subsequences of size 100 in this large array using CUDA. here an illustrative example with small size. the size of the array is 20 and the size of the sequence is 5.

tab = [80,12,14,5,70,9,26,30,8,12,16,15,60,12,38,32,17,67,19,11]

the sequences are as follow :

S1= 80+12+14+5+70
S2= 12+14+5+70+9
S3= 14+5+70+9+26
....

Do you have an efficient idea to parallelize this task using CUDA with an array of 1000 elements and a sequence of 100?

Upvotes: 0

Views: 305

Answers (2)

Robert Crovella
Robert Crovella

Reputation: 152289

Some preface comments:

  1. As already mentioned in the comments, a total array size of 1000 is a very small problem to run on a GPU. If this is the only thing you are doing on the GPU, you're probably not going to get interesting speed up from GPU code.
  2. The previous answer suggested a 1D stencil. Even for stencil width of 100 (or 200) it should still be possible to use a 1D stencil method. The only real limiting factor on stencil size here is the size of shared memory, and shared memory can easily hold a data size equal to the number of threads per block + 200 for the stencil (sequence) "halo". But that topic has already been covered there, so I'll present another method which should be able to handle arbitrary array lengths and arbitrary sequence sizes.

In this case, it may be appropriate to see whether we can leverage a prefix sum to help us. Fast parallel methods exist for prefix sum, so this makes it attractive to consider. A prefix sum is just an output sequence of numbers representing the sum of all previous numbers in the input sequence. So if I have an array like this:

1, 1, 3, 2, 0, 1

an exclusive prefix sum would be:

0, 1, 2, 5, 7, 7, 8

exclusive here means that the current sum position includes all previous values, but excludes the data item at the current position. For an exclusive prefix sum, note that we can (if we wish) generate "useful" data for a sequence that is 1 longer than the input sequence length.

The prefix sum can be easily adapted to solve the problem you are asking. Suppose for my above sequence, we wanted to compute the subsequences of 3. We could take our prefix sum, and subtract from it the same prefix sum sequence, shifted right by 3 (the sequence length), like so:

  0, 1, 2, 5, 7, 7, 8
-          0, 1, 2, 5, 7, 7, 8
=          5, 6, 5, 3

This sequence (5, 6, 5, 3) would be the desired answer, in this case.

What follows is a fully worked example, in thrust only. I've done it in thrust because writing a fast parallel prefix sum is not a trivial thing to do in CUDA, and I would therefore prefer to use (and would recommend others use) a library implementation. If you want to explore how to write your own parallel prefix sum, you can explore thrust, which is open source, or read this for an introduction.

Here's a fully worked example:

$ cat t1279.cu
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <iostream>

const int arr_size =  1000;
const int seq_len  =   100;

typedef int mytype;

int main(){

  // test case
  mytype t_arr[] = {80,12,14,5,70,9,26,30,8,12,16,15,60,12,38,32,17,67,19,11,0};
  int t_arr_size = sizeof(t_arr)/sizeof(mytype);
  int t_seq_len = 5;
  thrust::device_vector<mytype> d_arr1(t_arr, t_arr+t_arr_size);
  thrust::device_vector<mytype> d_res1(t_arr_size);
  thrust::device_vector<mytype> d_out1(t_arr_size-t_seq_len);

  thrust::exclusive_scan(d_arr1.begin(), d_arr1.end(), d_res1.begin());
  thrust::transform(d_res1.begin()+t_seq_len, d_res1.end(), d_res1.begin(), d_out1.begin(), thrust::minus<mytype>());
  thrust::copy_n(d_out1.begin(), t_arr_size-t_seq_len, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;

  // case with larger array length and larger sequence length
  thrust::device_vector<mytype> d_arr(arr_size+1, 1);
  thrust::device_vector<mytype> d_res(arr_size+1);
  thrust::device_vector<mytype> d_out(arr_size+1-seq_len);

  thrust::inclusive_scan(d_arr.begin(), d_arr.end(), d_res.begin());
  thrust::transform(d_res.begin()+seq_len, d_res.end(), d_res.begin(), d_out.begin(), thrust::minus<mytype>());
  // validate
  for (int i = 0; i < arr_size+1-seq_len; i++) {
    mytype t = d_out[i];
    if (t != seq_len) {std::cout << "mismatch at: " << i << "was: " << t << "should be: " << seq_len << std::endl; return 1;}
    }
  return 0;
}

$ nvcc -arch=sm_35 -o t1279 t1279.cu
$ ./t1279
181,110,124,140,143,85,92,81,111,115,141,157,159,166,173,146,
$

Upvotes: 3

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11926

Very similar question: Parallel implementation of the computation of the sum of contiguous subsequences in an array using Cuda

answered by Robert Crovella.

This was answered for sequence size=4 before, now this is only differing by size=100. Either you adapt that kernel to this similar question or you solve this on cpu since a cpu should be able to solve this(N=1000,L=100) problem within 1000-10000 cycles(complexity is O(N) ) on a single core while a gpu's many cycles would be unused and only an advanced gpu could do load-balancing to use those idle resources. CPU could solve this even before sending array to gpu. But if this is have to happen on GPU (between 2 kernels), then I can suggest only hardware-related optimizations on Robert Crovella's answer such as:

  • getting half of the array from texture cache and/or constant memory to further increase memory bandwidth(on top of shared/global memory) feeding the compute resources.(also must be able increase available size per compute unit/smx this way)
  • using multiple temporary accumulators instead of one, to increase chance of compiler doing some instruction optimizations(instruction level parallelism?)
  • using double precision (or single precision) to add some more performance if number of FP64 resources are comparable to FP32 resources, but only on every Nth thread group where N is something like 2 * total_FP32_units / total_FP64_units and only if there are multiple warps per smx to ensure both FP32 and FP64 resources are being used. Maybe one of the accumulators could be FP64 instead of using FP64 on a whole group.(but this would have more conversion latency to hide)
  • not just computing i-th index, but also i+1st and i+2nd index too, per thread, would need only 4 extra additions per thread instead of 200 more(on another two threads) after computing the i-th one. Addition of next element and subtraction of the oldest element for example. Why 3 jobs per thread? To keep all memory banks busy, instead of doing memory bank collisions(if there are 2,4,8,10 memory banks total). If there are 3,6,9 memory banks, then it could be 2 or 4 jobs thread. But this increases(doubles, triples) the temporary memory needs so there must be an equilibrium point of performance, which, could be benchmarked for some architectures to know it.

I don't know cuda, but it should be able to use multiple memory qualifiers to use more area of gpu chip to feed data faster just like opencl.

If shared memory and global memory have different/separate pipelines reaching to streaming processors, then you could even use global memory for some of the job(I mean, not loading all to shared, leaving some to be loaded from global instead and let it use all shared_lines + global_lines)

A 32M element array may not get help from first optimization, but you could still use FP64 accumulators to have multiple temporary accumulators (for example each thread using 8 FP32 accumulators and 1 FP64 accumulator(using FMA to decrease its latency which could be hidden behind those 8 accumulators), so whole array elements are treated equally) but if constant memory or texture cache has space for 50000 elements, it could serve 250 groups(each having 200 elements from this type of memory) or 1250 groups(each having 50 from this memory, 150 from others(shared,..)) and may still increase performance a little.(100k element array would need 1k groups for example, %25 of threads use constant memory)

Upvotes: 1

Related Questions