YAS
YAS

Reputation: 21

fast version of ismember() function in MATLAB

My question is about finding an alternative approach to what ismember() does in MATLAB in a much more faster way.

Here is my problem:

M [92786253*1]  (a: roughly 100M rows)
x [749*1]       (b: # of rows can vary from 100 to 10K)

I want to find rows in b that co-exists in a (the row indices of a). This operation needs to be repeated about 10M times for different version of b.

The Normal Approach:

 tic
 ind1 = ismember(M,x);
 toc

 Elapsed time is 0.515627 seconds.

The Fast Approach:

 tic
 n = 1;
 ind2 = find(any(all(bsxfun(@eq,reshape(x.',1,n,[]),M),2),3));
 toc

 Error using bsxfun
 Requested 92786253x1x749 (64.7GB) array exceeds maximum array size preference. 
 Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive.
 See array size limit or preference panel for more information.

 Error in demo_ismember_fast (line 23)
 ind2 = find(any(all(bsxfun(@eq,reshape(x.',1,n,[]),M),2),3))

The second approach is usually 15-20 times faster than the normal one, however in this case, I cannot use it for memory limitation. Are there any suggestion how to speedup this operation? Thanks for sharing expert opinions with me!

Upvotes: 2

Views: 3341

Answers (2)

Yair Altman
Yair Altman

Reputation: 5722

You might find the internal (built-in) ismembc function useful - it can be an order of magnitude faster than ismember: http://UndocumentedMatlab.com/blog/ismembc-undocumented-helper-function

Note that ismembc only works correctly for sorted non-sparse non-Nan numeric data.

Upvotes: 2

rahnema1
rahnema1

Reputation: 15867

If you can work with sorted a here are two alternative methods. Before starting 100M iterations some required input variables and the output variable ind are initialized and at each iteration ind is modified and at the end all element of it is set to false;

interp1:

s=sort(M);
edge = [-Inf s(2:end) Inf];
v = [1:numel(M) numel(M)];
ind = false(size(M));
%for ... 100M iterations
    tic
    bin = interp1(edge,v,x,'previous');
    ind(bin)= ind(bin)==x;
    toc
    %...
    ind(bin) = false;%at the end of each loop set all elements of ind to 0;
%end

histcounts:

s=sort(M);
edge= [-Inf s(2:end) Inf];
ind = false(size(M));
%for ... 100M iterations
    tic
    [~,~,bin]=histcounts(x,edge);
    ind(bin)= ind(bin)==x;
    toc
    %...
    ind(bin) = false;
%end

Upvotes: 1

Related Questions