Raketenolli
Raketenolli

Reputation: 745

Octave / Matlab : Determine unique rows in very large matrix

I have a 13146 x 13146 matrix in Octave whose unique rows I want to determine. unique(M, "rows") fails because of Octave internals and/or memory limitations.

I looked at other posts on finding unique rows, but none of them addressed this issue with large matrices.

My approach now would be to "divide and conquer", e. g. by

A(:,:,i)=M(r_i:s_i,:)
B(:,:,i)=unique(A(:,:,i), "rows")

with i the index of the submatrix, r_i and s_i the starting and ending row numbers of the submatrices. To return all the data into one large matrix (and again determine unique rows):

C(:,:,i)=B(:,:,i)'
D=reshape(C,l,n*m)
E=D'
F=unique(E, "rows")

with n the number of submatrices, m the original number of rows in a submatrix and l the number of columns.

Is there a better way to achieve the desired result?

Upvotes: 5

Views: 1541

Answers (3)

Mohsen Nosratinia
Mohsen Nosratinia

Reputation: 9864

You can use a sliding window on columns and use a window size that doesn't cause a memory problem. Here is a solution

function A = winuninque(A, winSz)
nCol = size(A,2);
I = zeros(size(A,1), 0);
for k=1:winSz:nCol
    [~, ~, I] = unique([I A(:,k:min(k+winSz-1,end))], 'rows');
end
[~, I] = unique(I);
A = A(I, :);
end

Benchmark

In order to actually have some duplicate rows it's better to generate the matrix with some duplicates, otherwise it will be just sorting. Here is a comparison between different approaches:

>> A=repmat(rand(13146,13146), 2, 1);
>> A=A(randperm(end), :);
>> A=A(1:end/2,:);

>> tic; B=rrunique(A); toc
Elapsed time is 13.318752 seconds.
>> tic; C=winunique(A, 16); toc
Elapsed time is 6.606122 seconds.
>> tic; D=unique(A, 'rows'); toc
Elapsed time is 29.815333 seconds.
>> isequal(B,C,D)
ans =
     1
>> size(D)
ans =
        9880       13146

Upvotes: 1

Divakar
Divakar

Reputation: 221534

The basic idea here is same as with Andrew's answer, just the implementation is a bit different at the later stage. So, we sort the rows of the input array bringing the duplicates on top of each other. Then, we run through the rows looking for the duplicates, which can be done efficiently with diff. From the diff output, we detect all zeros rows, which represent those duplicate rows. We create a logical mask out of the detection and use this mask to extract the valid rows from the input array. Here's the implementation and this seems to be using half the memory than with unique(...'rows') -

sM = sortrows(M);
out = sM([true ; any(diff(sM,[],1)~=0,2)],:);

Upvotes: 0

Andrew Janke
Andrew Janke

Reputation: 23858

Radix sort it!

Sounds like you need a memory-efficient sorting algorithm. Unique rows can be found by first sorting the rows, and then checking adjacent rows for duplicates. You could adapt a radix sort for this, sorting on each column in sequence (instead of each digit in sequence). That would be a peak memory cost of sorting one column instead of the entire matrix. Then step through the rows in the sorted result and eliminate duplicates. That's an O(n) operation and only needs enough memory to hold two rows.

It can be "stable", too. If you track the rearranged row indexes in addition to the rearranged row values during the process of sorting, you can compute the input-output mapping indexes. (These are the I in the signature of Matlab's own [B,I] = sort(A).) That in turn will allow you to rearrange the post-duplicate-removal rows back in to their original order in the input, so you can preserve their order. (Like the setOrder='stable' option of Matlab's unique().) They can also be used to calculate the in-out mapping indexes for the overall uniqueness operation, so you can reproduce the full multi-output signature of unique(), which can be quite useful.

Example code

Here's a basic example implementation. I haven't tested it thoroughly, so don't use it in production without doing your own testing.

function A = rrunique(A)
%RRUNIQUE "Radix Row Unique" - find unique rows using radix sort
%
% # Returns the distinct rows in A. Uses the memory-efficient radix sort
% # algorithm, so peak memory usage stays low(ish) for large matrices.

% # This uses a modified radix sort where only the row remapping indexes are
% # rearranged at each step, instead of sorting the whole input, to avoid
% # having to rewrite the large input matrix.

ix = 1:size(A,1); % # Final in-out mapping indexes

% # Radix sort the rows
for iCol = size(A,2):-1:1
    c = A(ix,iCol);
    [~,ixStep] = sort(c);
    % # Don't do this! too slow
    % # A = A(ixStep,:);
    % # Just reorder the mapping indexes
    ix = ix(ixStep);
end    

% # Now, reorder the big array all at once
A = A(ix,:);

% # Remove duplicates
tfKeep = true(size(A,1),1);
priorRow = A(1,:);
for iRow = 2:size(A,1)
    thisRow = A(iRow,:);
    if isequal(thisRow, priorRow)
        tfKeep(iRow) = false;
    else
        priorRow = thisRow;
    end
end
A = A(tfKeep,:);

end

When I tested this on a matrix of your size on Matlab R2014b on OS X, it peaked at around 3 GB of memory used, vs about 1 GB to hold just the input matrix. Not bad.

>> m = rand([13146,13146]);
>> tic; rrunique(m); toc
Elapsed time is 17.435783 seconds.

Upvotes: 6

Related Questions