Reputation: 21
My question is about finding an alternative approach to what ismember()
does in MATLAB in a much more faster way.
Here is my problem:
M [92786253*1] (a: roughly 100M rows)
x [749*1] (b: # of rows can vary from 100 to 10K)
I want to find rows in b
that co-exists in a
(the row indices of a). This operation needs to be repeated about 10M times for different version of b
.
The Normal Approach:
tic
ind1 = ismember(M,x);
toc
Elapsed time is 0.515627 seconds.
The Fast Approach:
tic
n = 1;
ind2 = find(any(all(bsxfun(@eq,reshape(x.',1,n,[]),M),2),3));
toc
Error using bsxfun
Requested 92786253x1x749 (64.7GB) array exceeds maximum array size preference.
Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive.
See array size limit or preference panel for more information.
Error in demo_ismember_fast (line 23)
ind2 = find(any(all(bsxfun(@eq,reshape(x.',1,n,[]),M),2),3))
The second approach is usually 15-20 times faster than the normal one, however in this case, I cannot use it for memory limitation. Are there any suggestion how to speedup this operation? Thanks for sharing expert opinions with me!
Upvotes: 2
Views: 3341
Reputation: 5722
You might find the internal (built-in) ismembc
function useful - it can be an order of magnitude faster than ismember
: http://UndocumentedMatlab.com/blog/ismembc-undocumented-helper-function
Note that ismembc
only works correctly for sorted non-sparse non-Nan numeric data.
Upvotes: 2
Reputation: 15867
If you can work with sorted a
here are two alternative methods. Before starting 100M iterations some required input variables and the output variable ind
are initialized and at each iteration ind
is modified and at the end all element of it is set to false
;
interp1:
s=sort(M);
edge = [-Inf s(2:end) Inf];
v = [1:numel(M) numel(M)];
ind = false(size(M));
%for ... 100M iterations
tic
bin = interp1(edge,v,x,'previous');
ind(bin)= ind(bin)==x;
toc
%...
ind(bin) = false;%at the end of each loop set all elements of ind to 0;
%end
histcounts:
s=sort(M);
edge= [-Inf s(2:end) Inf];
ind = false(size(M));
%for ... 100M iterations
tic
[~,~,bin]=histcounts(x,edge);
ind(bin)= ind(bin)==x;
toc
%...
ind(bin) = false;
%end
Upvotes: 1