Reputation: 9875
I have 4 n-by-1 column vectors where sharing the same index number means they are of the same timestamp. I want to remove "rows" that are identical to their immediate preceding "rows" and imagine having this performed recursively until no change.
For example, suppose the 4 vectors are
C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];
The desired output is
ans=[1;3;5];
because [C1(ans),C2(ans),C3(ans),C4(ans)]
is an array with no row identical to its preceding row. In the above example, the resulting vectors look like:
C1=[1;3;1];
C2=[2;4;2];
C3=[0;0;0];
C4=[5;6;5];
"Rows" as in the rows when looking at the vectors concatenated column-wise with [C1,C2,C3,C4]
.
Some notes:
The reason I started with 4 separated column vectors is as follows:
I have one other n-by-1 vector with unique elements where I will be removing the same "rows" based on the indices removed for the other 4 vectors;
in my application, the data is retrieved from elsewhere and stored into a Maltab data type element by element for further processing and I encounter performance advantage with storing into 4 N-by-1 double over into 1 N-by-4 double. This N is in the hundreds of thousands or millions.
n is typically only several thousands at a time but I have a need to minimize the time each filtering takes as much within 1 second and small as possible.
(I want to learn the methods using native functions and compare performance.)
It's a bit hard to demonstrate performance differences on this one since random data is not suitable and too specific data is unsuitable. (By hard, I mean it's hard to do quickly.)
But in case anyone is interested, with a table of ~164k rows and only ~1k "unique" rows, ("" around rows as well,) the results from timeit()
are as follows.
Cris' diff or
method: 0.0028s
Wolfie's unique
method: 0.0142s
Wolfie's arrayfun
method: 0.3912s
Thomas' diff*ones
method: 0.0057s
Thomas' recursion method: Unable to complete. This blew up Matlab's RAM request to ~70GB within a minute of execution under timeit()
and caused UI freeze on my Win 10 machine despite of the machine having lots of un-used CPU.
Loop (but with varargin
on num of columns): 3.6313s
The testing functions included concatenating if not directly processing columns.
The loop version is:
function varargout = accum(varargin)
for i=1:numel(varargin)
varargout{i}=varargin{i}(1); % assuming single column
end
for i=2:numel(varargin{1}) % assuming equal length
TF=false;
for j=1:numel(varargin)
TF=TF||varargin{j}(i)~=varargin{j}(i-1);
end
if TF
for j=1:numel(varargin)
varargout{j}=[varargout{j};varargin{j}(i)];
end
end
end
end
If you are writing another answer and need sample data, let me know. Otherwise, I'll skip pasting it, seeing little use in doing so.
Upvotes: 2
Views: 151
Reputation: 30046
You could use a similar approach to the answer given by Cris (find(diff(...)))
), but make it more generic using unique
.
Setup:
C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];
C = [C1,C2,C3,C4];
Method one:
[~,~,iu] = unique( C, 'rows' );
idx = find( [1; diff(iu)] );
Alternatively, you could loop through (shorthanded with arrayfun
) to find rows where any element differs from the previous row
Method two:
idx = find( [1, arrayfun( @(ii) any(C(ii,:) ~= C(ii-1,:)), 2:size(C,1) )] )
Upvotes: 0
Reputation: 101343
Here is an option using logical values to subset rows in matrix
C([true; abs(C(2:end,:)-C(1:end-1,:))*ones(size(C,2),1)>0],:)
which gives
ans =
1 2 0 5
3 4 0 6
1 2 0 5
If you don't mind using a user function method, below might be another option, where myfun
recursively computes the "unique" rows
function y = myfun(x)
if size(x,1)==1
y = x;
else
v = x(end,:);
y = myfun(x(1:(end-1),:));
if ~all(y(end,:)==v)
y = [y;v];
end
end
end
such that
>> z = myfun(C)
z =
1 2 0 5
3 4 0 6
1 2 0 5
where C = [C1,C2,C3,C4]
Upvotes: 2
Reputation: 60504
I think that the following gives the desired output (not tested):
find([1; diff(C1) | diff(C2) | diff(C3) | diff(C4)])
diff
is non-zero where two subsequent elements are different. Using logical OR we require that any one vector has a difference at any one position. The first element is always part of the output. find
returns indices of non-zero elements.
Upvotes: 4