Reputation: 8597
If I have two matrices A and B of size [m x n] and [p x n], and I want to find the counts of the number of times each row of B appears in A, for example:
>> A = rand(5,3)
A =
0.1419 0.6557 0.7577
0.4218 0.0357 0.7431
0.9157 0.8491 0.3922
0.7922 0.9340 0.6555
0.9595 0.6787 0.1712
>> B = [A(2,:); A(1,:); A(2,:); A(3,:); A(3,:); A(4,:); A(5,:)]
B =
0.4218 0.0357 0.7431
0.1419 0.6557 0.7577
0.4218 0.0357 0.7431
0.9157 0.8491 0.3922
0.9157 0.8491 0.3922
0.7922 0.9340 0.6555
0.9595 0.6787 0.1712
with the answer in this case being
ans =
1 2 2 1 1
although unlike this example, in general m >> p
If A and B were vectors matlab's histc would do the job, but there doesn't seem to be an equivalent if the bins are vectors.
Currently I do it by:
for i=1:length(B)
indices(i) = find(abs(A/B(i,:)-1) < 1e-15);
% division requires a tolerance due to numerical issues
end
histc(indices, 1:size(A,1))
ans =
1 2 2 1 1
but since I have many such matrices B, and both A and B are large, this is horribly slow. Any ideas how to improve on this?
EDIT:
Looking at the methods so far, I have the following data:
A 7871139x3 188907336 double
B 902x3 21648 double
To make things quicker I'm just going to use the first 10 rows of B
B = B(1:10,:);
Note that for the full application I (currently) have >10^4 of such matrices (this will eventually be >10^6 ....)
My first method:
tic, C = get_vector_index(A,B); toc
Elapsed time is 36.630107 seconds.
bdecaf's method (can be reduced to ~25 seconds by removing the if
statement and using L1 distance instead of L2 distance)
>> tic, C1 = get_vector_index(A,B); toc
Elapsed time is 28.957243 seconds.
>> isequal(C, C1)
ans =
1
oli's pdist2 method
>> tic, C2 = get_vector_index(A,B); toc
Elapsed time is 7.244965 seconds.
>> isequal(C,C2)
ans =
1
oli's normalisation method
>> tic, C3 = get_vector_index(A,B); toc
Elapsed time is 3.756682 seconds.
>> isequal(C,C3)
ans =
1
Finally I came up with another method, where I search the first column, then I search the second column within the hits of the first column, recursing until the columns are exhausted. This is the fastest so far ....
N = size(A,2);
loc = zeros(size(B,1),1);
for i=1:size(B,1)
idx{1} = find(A(:,1)==B(i,1));
for k=2:N,
idx{k} = idx{k-1}(find(A(idx{k-1},k)==B(i,k)));
end
loc(i) = idx{end};
end
C = histc(loc, 1:size(A,1));
which results in:
>> tic, C4 = get_vector_index(A,B); toc
Elapsed time is 1.314370 seconds.
>> isequal(C, C4)
ans =
1
Also note that using intersect
is much slower:
>> tic, [~,IA] = intersect(A,B,'rows'); C5 = histc(IA,1:size(A,1)); toc
Elapsed time is 44.392593 seconds.
>> isequal(C,C5)
ans =
1
Upvotes: 3
Views: 2535
Reputation: 8597
I actually put this solution as an edit in the question, but for the sake of accepting an answer I'll put the solution in here too:
N = size(A,2);
loc = zeros(size(B,1),1);
for i=1:size(B,1)
idx{1} = find(A(:,1)==B(i,1));
for k=2:N,
idx{k} = idx{k-1}(find(A(idx{k-1},k)==B(i,k)));
end
loc(i) = idx{end};
end
C = histc(loc, 1:size(A,1));
which results in:
>> tic, C4 = get_vector_index(A,B); toc
Elapsed time is 1.314370 seconds.
>> isequal(C, C4)
ans =
1
Upvotes: 0
Reputation: 4732
I would solve it like this:
indices = zeros(size(A,1),1);
for i=1:size(B,1)
distances = sum( ( repmat(B(i,:),size(A,1),1)-A ).^2 ,2);
[md,im]=min(distances);
if md < 1e-9
indices(im) = indices(im)+1;
end
end
if you remove the if it will just sort into the nearest bin.
Upvotes: 1
Reputation: 16035
Actually, a simpler way of doing that is:
sum(10e-9>pdist2(A',B'),2)
it compute all pairwise distances and threshhold and count.
Upvotes: 1
Reputation: 16035
Maybe you could normalize them such that you check that their dot product is 1
A = rand(5,3);
B = [A(2,:); A(1,:); A(2,:); A(3,:); A(3,:); A(4,:); A(5,:)];
A2=bsxfun(@times,A,1./sqrt(sum(A.^2,2))); %%% normalize A
B2=bsxfun(@times,B,1./sqrt(sum(B.^2,2))) %%% normalize B
sum(A2*B2'>1-10e-9,2) %%% check that the dotproduct is close to 1
ans =
1
2
2
1
1
If you need something even faster but approximated, I recommand you to use the flann library, which is for fast approximated nearest neighbor:
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
Upvotes: 1