Reputation: 43
I have a code written in Matlab that uses 'intersect' to find the vectors (and their indices) that intersect in two large matrices. I found that 'intersect' is the slowest line (by a large difference) in my code. Unfortunately I couldn't find a faster alternative so far.
As an example running the code below takes approx 5 seconds on my pc:
profile on
for i = 1 : 500
a = rand(10000,5);
b = rand(10000,5);
[intersectVectors, ind_a, ind_b] = intersect(a,b,'rows');
end
profile viewer
I was wondering if there is a faster way. Note that the matrices (a) and (b) have 5 columns. The number of rows don't necessary have to be the same for the two matrices.
Any help would be great. Thanks
Upvotes: 4
Views: 1187
Reputation: 221614
You can use an approach that leverages fast matrix multiplication in MATLAB
to convert those 5
columns of input arrays into one column by considering each column as a significant "digit" of a single number. Thus, you would end up with an array with only column and then, you can use intersect
or ismember
without 'rows'
and that must speedup the codes in a big way!
Here are the promised implementations as function codes for easy usage -
intersectrows_fast_v1.m:
function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v1(a,b)
%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;
%// Use intersect without 'rows' option for a good speedup
[~, ind_a, ind_b] = intersect(acol1,bcol1);
intersectVectors = a(ind_a,:);
return;
intersectrows_fast_v2.m:
function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v2(a,b)
%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;
%// Use ismember to get indices of the common elements
[match_a,idx_b] = ismember(acol1,bcol1);
%// Now, with ismember, duplicate items are not taken care of automatically as
%// are done with intersect. So, we need to find the duplicate items and
%// remove those from the outputs of ismember
[~,a_sorted_ind] = sort(acol1);
a_rm_ind =a_sorted_ind([false;diff(sort(acol1))==0]); %//indices to be removed
match_a(a_rm_ind)=0;
intersectVectors = a(match_a,:);
ind_a = find(match_a);
ind_b = idx_b(match_a);
return;
With the datasizes listed in the question, the runtimes were -
-------------------------- With original approach
Elapsed time is 3.885792 seconds.
-------------------------- With Proposed approach - Version - I
Elapsed time is 0.581123 seconds.
-------------------------- With Proposed approach - Version - II
Elapsed time is 0.963409 seconds.
The results seem to suggest a big advantage in favour of the version - I
of the two proposed approaches with a whooping speedup of around 6.7x
over the original approach!!
Also, please note that if you don't need any one or two of the three outputs from the original intersect
with 'rows' based approach, then both the proposed approaches could be further shortened for better runtime performances!
Upvotes: 3