Argyll
Argyll

Reputation: 9875

How to remove repeating rows in a double array so that no row is identical to its preceding row

I have 4 n-by-1 column vectors where sharing the same index number means they are of the same timestamp. I want to remove "rows" that are identical to their immediate preceding "rows" and imagine having this performed recursively until no change.

For example, suppose the 4 vectors are

C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];

The desired output is

ans=[1;3;5];

because [C1(ans),C2(ans),C3(ans),C4(ans)] is an array with no row identical to its preceding row. In the above example, the resulting vectors look like:

C1=[1;3;1];
C2=[2;4;2];
C3=[0;0;0];
C4=[5;6;5];

"Rows" as in the rows when looking at the vectors concatenated column-wise with [C1,C2,C3,C4].

The question:


Some notes:

The reason I started with 4 separated column vectors is as follows:

  1. I have one other n-by-1 vector with unique elements where I will be removing the same "rows" based on the indices removed for the other 4 vectors;

  2. in my application, the data is retrieved from elsewhere and stored into a Maltab data type element by element for further processing and I encounter performance advantage with storing into 4 N-by-1 double over into 1 N-by-4 double. This N is in the hundreds of thousands or millions.

n is typically only several thousands at a time but I have a need to minimize the time each filtering takes as much within 1 second and small as possible.

(I want to learn the methods using native functions and compare performance.)


Note on performance

It's a bit hard to demonstrate performance differences on this one since random data is not suitable and too specific data is unsuitable. (By hard, I mean it's hard to do quickly.)

But in case anyone is interested, with a table of ~164k rows and only ~1k "unique" rows, ("" around rows as well,) the results from timeit() are as follows.

The testing functions included concatenating if not directly processing columns.

The loop version is:

function varargout = accum(varargin)

    for i=1:numel(varargin)
        varargout{i}=varargin{i}(1);    % assuming single column
    end

    for i=2:numel(varargin{1})  % assuming equal length
        TF=false;
        for j=1:numel(varargin)
            TF=TF||varargin{j}(i)~=varargin{j}(i-1);
        end
        if TF
            for j=1:numel(varargin)
                varargout{j}=[varargout{j};varargin{j}(i)];
            end
        end
    end

end

If you are writing another answer and need sample data, let me know. Otherwise, I'll skip pasting it, seeing little use in doing so.

Upvotes: 2

Views: 151

Answers (3)

Wolfie
Wolfie

Reputation: 30046

You could use a similar approach to the answer given by Cris (find(diff(...)))), but make it more generic using unique.

Setup:

C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];

C = [C1,C2,C3,C4];

Method one:

[~,~,iu] = unique( C, 'rows' );
idx = find( [1; diff(iu)] );

Alternatively, you could loop through (shorthanded with arrayfun) to find rows where any element differs from the previous row

Method two:

idx = find( [1, arrayfun( @(ii) any(C(ii,:) ~= C(ii-1,:)), 2:size(C,1) )] )

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 101343

Here is an option using logical values to subset rows in matrix

C([true; abs(C(2:end,:)-C(1:end-1,:))*ones(size(C,2),1)>0],:)

which gives

ans =

   1   2   0   5
   3   4   0   6
   1   2   0   5

If you don't mind using a user function method, below might be another option, where myfun recursively computes the "unique" rows

function y = myfun(x)
  if size(x,1)==1
    y = x;
  else
    v = x(end,:);
    y = myfun(x(1:(end-1),:));
    if ~all(y(end,:)==v)
      y = [y;v];
    end
   end
end

such that

>> z = myfun(C)
z =

   1   2   0   5
   3   4   0   6
   1   2   0   5

where C = [C1,C2,C3,C4]

Upvotes: 2

Cris Luengo
Cris Luengo

Reputation: 60504

I think that the following gives the desired output (not tested):

find([1; diff(C1) | diff(C2) | diff(C3) | diff(C4)])

diff is non-zero where two subsequent elements are different. Using logical OR we require that any one vector has a difference at any one position. The first element is always part of the output. find returns indices of non-zero elements.

Upvotes: 4

Related Questions