Reputation: 140
I have a large array of 1000 elements, I want to compute the sum of large contiguous subsequences of size 100 in this large array using CUDA. here an illustrative example with small size. the size of the array is 20 and the size of the sequence is 5.
tab = [80,12,14,5,70,9,26,30,8,12,16,15,60,12,38,32,17,67,19,11]
the sequences are as follow :
S1= 80+12+14+5+70
S2= 12+14+5+70+9
S3= 14+5+70+9+26
....
Do you have an efficient idea to parallelize this task using CUDA with an array of 1000 elements and a sequence of 100?
Upvotes: 0
Views: 305
Reputation: 152289
Some preface comments:
In this case, it may be appropriate to see whether we can leverage a prefix sum to help us. Fast parallel methods exist for prefix sum, so this makes it attractive to consider. A prefix sum is just an output sequence of numbers representing the sum of all previous numbers in the input sequence. So if I have an array like this:
1, 1, 3, 2, 0, 1
an exclusive prefix sum would be:
0, 1, 2, 5, 7, 7, 8
exclusive here means that the current sum position includes all previous values, but excludes the data item at the current position. For an exclusive prefix sum, note that we can (if we wish) generate "useful" data for a sequence that is 1 longer than the input sequence length.
The prefix sum can be easily adapted to solve the problem you are asking. Suppose for my above sequence, we wanted to compute the subsequences of 3. We could take our prefix sum, and subtract from it the same prefix sum sequence, shifted right by 3 (the sequence length), like so:
0, 1, 2, 5, 7, 7, 8
- 0, 1, 2, 5, 7, 7, 8
= 5, 6, 5, 3
This sequence (5, 6, 5, 3) would be the desired answer, in this case.
What follows is a fully worked example, in thrust only. I've done it in thrust because writing a fast parallel prefix sum is not a trivial thing to do in CUDA, and I would therefore prefer to use (and would recommend others use) a library implementation. If you want to explore how to write your own parallel prefix sum, you can explore thrust, which is open source, or read this for an introduction.
Here's a fully worked example:
$ cat t1279.cu
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <iostream>
const int arr_size = 1000;
const int seq_len = 100;
typedef int mytype;
int main(){
// test case
mytype t_arr[] = {80,12,14,5,70,9,26,30,8,12,16,15,60,12,38,32,17,67,19,11,0};
int t_arr_size = sizeof(t_arr)/sizeof(mytype);
int t_seq_len = 5;
thrust::device_vector<mytype> d_arr1(t_arr, t_arr+t_arr_size);
thrust::device_vector<mytype> d_res1(t_arr_size);
thrust::device_vector<mytype> d_out1(t_arr_size-t_seq_len);
thrust::exclusive_scan(d_arr1.begin(), d_arr1.end(), d_res1.begin());
thrust::transform(d_res1.begin()+t_seq_len, d_res1.end(), d_res1.begin(), d_out1.begin(), thrust::minus<mytype>());
thrust::copy_n(d_out1.begin(), t_arr_size-t_seq_len, std::ostream_iterator<mytype>(std::cout, ","));
std::cout << std::endl;
// case with larger array length and larger sequence length
thrust::device_vector<mytype> d_arr(arr_size+1, 1);
thrust::device_vector<mytype> d_res(arr_size+1);
thrust::device_vector<mytype> d_out(arr_size+1-seq_len);
thrust::inclusive_scan(d_arr.begin(), d_arr.end(), d_res.begin());
thrust::transform(d_res.begin()+seq_len, d_res.end(), d_res.begin(), d_out.begin(), thrust::minus<mytype>());
// validate
for (int i = 0; i < arr_size+1-seq_len; i++) {
mytype t = d_out[i];
if (t != seq_len) {std::cout << "mismatch at: " << i << "was: " << t << "should be: " << seq_len << std::endl; return 1;}
}
return 0;
}
$ nvcc -arch=sm_35 -o t1279 t1279.cu
$ ./t1279
181,110,124,140,143,85,92,81,111,115,141,157,159,166,173,146,
$
Upvotes: 3
Reputation: 11926
Very similar question: Parallel implementation of the computation of the sum of contiguous subsequences in an array using Cuda
answered by Robert Crovella.
This was answered for sequence size=4 before, now this is only differing by size=100. Either you adapt that kernel to this similar question or you solve this on cpu since a cpu should be able to solve this(N=1000,L=100) problem within 1000-10000 cycles(complexity is O(N) ) on a single core while a gpu's many cycles would be unused and only an advanced gpu could do load-balancing to use those idle resources. CPU could solve this even before sending array to gpu. But if this is have to happen on GPU (between 2 kernels), then I can suggest only hardware-related optimizations on Robert Crovella
's answer such as:
I don't know cuda, but it should be able to use multiple memory qualifiers to use more area of gpu chip to feed data faster just like opencl.
If shared memory and global memory have different/separate pipelines reaching to streaming processors, then you could even use global memory for some of the job(I mean, not loading all to shared, leaving some to be loaded from global instead and let it use all shared_lines + global_lines)
A 32M element array may not get help from first optimization, but you could still use FP64 accumulators to have multiple temporary accumulators (for example each thread using 8 FP32 accumulators and 1 FP64 accumulator(using FMA to decrease its latency which could be hidden behind those 8 accumulators), so whole array elements are treated equally) but if constant memory or texture cache has space for 50000 elements, it could serve 250 groups(each having 200 elements from this type of memory) or 1250 groups(each having 50 from this memory, 150 from others(shared,..)) and may still increase performance a little.(100k element array would need 1k groups for example, %25 of threads use constant memory)
Upvotes: 1