Reputation: 259
This is seemingly a simple problem but I just can’t figure out an elegant way to do this with CUDA Thrust.
I have a two dimensional matrix NxM and a vector of desired row indices of size L that is a subset of all rows(i.e. L < N) and is not regular (basically an irregular list like, 7,11,13,205,... etc.). The matrix is stored by rows in a thrust device vector. The array of indices is a device vector as well. Here are my two questions:
Thank you very much for your help.
Upvotes: 2
Views: 3424
Reputation:
if you are not fixed on thrust, check out Arrafire:
surprisingly unlike thrust, this library has a native support for subscript indexing, so that your problem can be solved in just few lines of code:
const int N = 7, M = 5;
float L_host[] = {3, 6, 4, 1};
int szL = sizeof(L_host) / sizeof(float);
// generate random NxM matrix with cuComplex data
array A = randu(N, M, c32);
// array used to index rows
array L(szL, 1, L_host);
print(A);
print(L);
array B = A(L,span); // copy selected rows of A
print(B);
and the results:
A =
0.7402 + 0.9210i 0.6814 + 0.2920i 0.5786 + 0.5538i 0.2133 + 0.4131i 0.7305 + 0.9400i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
0.9251 + 0.4464i 0.1541 + 0.4452i 0.2783 + 0.6192i 0.7214 + 0.3546i 0.2674 + 0.0208i
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.7762 + 0.2948i 0.2343 + 0.8793i 0.0937 + 0.6326i 0.1820 + 0.5984i 0.5298 + 0.8127i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
L = (row indices)
3.0000
6.0000
4.0000
1.0000
B =
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
it also works pretty fast. I tested this with an array of cuComplex of size 2000 x 2000 using the following code:
float *g_data = 0, *g_data2 = 0;
int g_N = 2000, g_M = 2000, // matrix of size g_N x g_M
g_L = 400; // copy g_L rows
void af_test()
{
array A(g_N, g_M, (cuComplex *)g_data, afDevicePointer);
array L(g_L, 1, g_data2, afDevicePointer);
array B = (A(L, span));
std::cout << "sz: " << B.elements() << "\n";
}
int main()
{
// input matrix N x M of cuComplex
array in = randu(g_N, g_M, c32);
g_data = (float *)in.device< cuComplex >();
// generate unique row indices
array in2 = setunique(floor(randu(g_L) * g_N));
print(in2);
g_data2 = in2.device<float>();
const int N_ITERS = 30;
try {
info();
af::sync();
timer::tic();
for(int i = 0; i < N_ITERS; i++) {
af_test();
}
af::sync();
printf("af: %.5f seconds\n", timer::toc() / N_ITERS);
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
}
in.unlock();
in2.unlock();
}
Upvotes: 1
Reputation: 72349
What you are asking about seems like a pretty straight forward stream compaction problem, and there isn't any particular problem doing it with thrust, but there are a couple of twists. In order to select the rows to copy, you need to have an stencil or key that the stream compaction algorithm can use. That needs to be constructed by a search or select operation using your list of rows to copy.
One example procedure to do this would go something like this:
counting_iterator
and transform_iterator
which can be combined to do thisthrust::binary search
can be used for this. The search yields the stencil for the stream compaction operationthrust::copy_if
to perform the stream compaction on the input matrix with the stencil.It sounds like a lot of work and intermediate steps, but the counting and transformation iterators don't actually produce any intermediate device vectors. The only intermediate storage required is the stencil array, which can be a boolean (so m*n bytes).
A full example in code:
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
struct div_functor : public thrust::unary_function<int,int>
{
int m;
div_functor(int _m) : m(_m) {};
__host__ __device__
int operator()(int x) const
{
return x / m;
}
};
struct is_true
{
__host__ __device__
bool operator()(bool x) { return x; }
};
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Counting iterator for generating sequential indices
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// construct device iterator supplying row numbers via a functor
typedef thrust::counting_iterator<int> counter;
typedef thrust::transform_iterator<div_functor, counter> rowIterator;
rowIterator rows_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), div_functor(n));
rowIterator rows_end = rows_begin + (m*n);
// constructor a stencil array which indicates which entries will be copied
thrust::device_vector<bool> docopy(m*n);
thrust::binary_search(select.begin(), select.end(), rows_begin, rows_end, docopy.begin());
// use stream compaction on the matrix with the stencil array
thrust::device_vector<float> out_matrix(l*n);
thrust::copy_if(in_matrix.begin(), in_matrix.end(), docopy.begin(), out_matrix.begin(), is_true());
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(usual disclaimer: use at your own risk)
About the only comment I would make is that the predicate to the copy_if
call feels a bit redundant given we have already a binary stencil that could be used directly, but there doesn't seem to be a variant of the compaction algorithms which can operate on a binary stencil directly. Similarly, I could not think of a sensible way to use the list of rows directly in the stream compaction call. There might well be a more efficient way to do this with thrust, but this should at least get you started.
From your comment, it seems that space is tight and the additional memory overhead of the binary search and stencil creation is prohibitive for your application. In that case I would follow the advice I offered in a comment to Roger Dahl's answer, and use a custom copy kernel instead. Thrust device vectors can be cast to a pointer you can pass directly to a kernel (thrust::raw_pointer_cast
), so it need not interfere with your existing thrust code. I would suggest using a block of threads per row to copy, that allows coalescing of reads and writes and should perform a lot better than using thrust::copy
for each row. A very simple implementation might look something like this (reusing most of my thrust example):
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
__global__
void rowcopykernel(const float *in, float *out, const int *list, const int m, const int n, const int l)
{
__shared__ const float * inrowp;
__shared__ float * outrowp;
if (threadIdx.x == 0) {
inrowp = (blockIdx.x < l) ? in + (n*list[blockIdx.x]) : 0;
outrowp = out + (n*blockIdx.x);
}
__syncthreads();
for(int i=threadIdx.x; (inrowp != 0) && (i<n); i+=blockDim.x) {
*(outrowp+i) = *(inrowp+i);
}
}
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// Output matrix
thrust::device_vector<float> out_matrix(l*n);
// raw pointer to thrust vectors
int * selp = thrust::raw_pointer_cast(&select[0]);
float * inp = thrust::raw_pointer_cast(&in_matrix[0]);
float * outp = thrust::raw_pointer_cast(&out_matrix[0]);
dim3 blockdim = dim3(128);
dim3 griddim = dim3(l);
rowcopykernel<<<griddim,blockdim>>>(inp, outp, selp, m, n, l);
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(standard disclaimer: use at your own risk).
The execution parameter selection could be made fancier, but otherwise that should be about all that is required. If your rows are very small, you might want to investigate using a warp per row rather than a block (so one block copies several rows). If you have more than 65535 output rows, then you will need to either use a 2D grid, or modify the code to have each block do multiple rows. But, as with the thrust based solution about, this should get you started.
Upvotes: 3
Reputation: 15724
I don't think there is a way to do this with Thrust but, because the operation will be memory bound, it should be easy to write a kernel that performs this operation at maximum possible performance. Simply create the same number of threads as there are indices in the vector. Have each thread calculate the source and destination addresses for one row and then use memcpy()
to copy the row.
You may also want to carefully consider if it is possible to set up subsequent processing steps to access the rows in place, thereby avoiding the entire, expensive "compacting" operation, that only shuffles memory around. Even if addressing the rows becomes slightly more complicated (an extra memory lookup and multiply, maybe), overall performance may be much better.
Upvotes: 0