Reputation: 360
I'm trying to do a naive GPU parallel implementation of K-nearest neighbors using Matlab (i.e. I'm trying to avoid going to CUDA for now). Let's say I have a function
idx=MyKnn(A,x,k)
which computes the indices of the k rows of A which have smallest distance to x. We think of A as an N-by-d matrix, i.e. N vectors in d-dimensions. x is a single d-dimensional vector.
Now, say I have two (or more) vectors x and y, and I want to compute MyKnn(A,x,k) and MyKnn(A,y,k) (with both querying the same array A). Shouldn't I be able to run this in parallel on my GPU? Matlab has the arrayfun() method for applying the same function to every element of an array, but how would I apply the same function to say every row or column of an array?
If "use a for loop" is the best answer, that's fine - that's what I have going right now, it just seems like I should be able to eek a little more parallelism out.
Upvotes: 3
Views: 265
Reputation: 101
I've been thinking about this for awhile so I'll take a stab at it, although I don't have a 100% answer. I have a lot of experience working in CUDA and Matlab and MEX, although I don't have experience working with the GPU portion of the Parallel Computing Toolbox, which you might be using.
To answer your question directly: use cellfun
instead of arrayfun
to get the behavior you're thinking of. Obviously, you'll need to re-index into cells.
The longer answer is that I don't think it would help. If you're spending lots of cycles on it you should try stuff out and see what works, but here's my reasoning: I'm going to assume that your MyKnn
function looks like this:
t = bsxfun(@plus, A, -d);
distSq = sum(t .* t, 2);
[ignore, bestKs] = sort(distSq);
idx = bestKs(1:k);
For small k
you can sub out the sort with a specialized function (e.g. you can write a simple one that works in O(kN)
time instead of O(N log N)
time).
For this kind of implementation, assuming that your N is reasonably large (say, 10,000 or larger), there is not too much advantage to be gained out of parallellism because there is already enough vectorization inside the function. I would just stick with a for loop.
In general, I would say that although the GPU affords large parallellism, each of the hundreds of concurrent threads within the GPU is quite simple, so it's not true that you can just assign each one to a separate MyKnn task. If I were writing this in CUDA directly I would have had the GPU to work on a single MyKnn at a time (again, assuming that N is large).
Upvotes: 1