Reputation: 867
Given an Integer array, find all the indices of all the duplicate elements in it.
For example, consider
A = [4, 12, 9, 8, 9, 12, 7, 1]
. Since 12 and 9 have duplicates, all their indices will be returned i.e.d = [2, 3, 5, 6]
. The length ofA
is below 200 and the integer elements are between 1 and 5000.
Currently I am using the following function. However to meet my requirements, this function is slowing me down. Is there any possible performance improvement ?
function d = fincDuplicates(A)
U = unique(A);
[co,ce] = hist(A,U);
an = ce(co>1);
d=[];
for i=1:numel(an)
d=[d,find(A==an(i))];
end
end
Upvotes: 3
Views: 288
Reputation: 11792
Edits:
1: Code corrected for edge case highlighted in comment, benchmark updated.
2: Added 'expansion' solution to benchmark (had to reduce max N element to 20000).
3: Added accumarray
method to benchmark (Winner for high N), and sparse
method.
Here's another way to get the result, without using the function unique
or hist
.
It does relly on the function sort
.
In an expanded form (if you want to see the result of the intermediate steps):
A = [4, 12, 9, 8, 9, 12, 7, 1] ;
[B,I] = sort(A) ; % this will put duplicate elements side by side
df = diff(B) ; % the duplicates will return '0' when substracted
dx = find(df==0) ; % find their indices
% Since each duplicate concerns 2 elemts of the array, we add the next
% index for each "flagged" index, taking care not to duplicate the indices
% of sucessive duplicates.
if ~isempty(dx)
dd = [diff(dx)~=1 , true] ;
dx = [dx dx(dd)+1] ;
d = I(dx) % get the original position of the duplicates in original array
else
d=[] ;
end
Which you could compact to:
[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
if ~isempty(dx)
d = I([dx dx([diff(dx)~=1,true])+1]) ;
else
d = [] ;
end
gives:
d =
3 2 5 6
Personnaly I would also sort
the returned indices but if it's not necessary and you're concerned about performance you can just accept the unsorted result.
Here's another benchmark (testing the number of element from 10 to 20000):
run on MATLAB R2016a
And the code for it:
function ExecTimes = benchmark_findDuplicates
nOrder = (1:9).' * 10.^(1:3) ; nOrder = [nOrder(:) ; 10000 ; 20000 ] ;
npt = numel(nOrder) ;
ExecTimes = zeros(npt,6) ;
for k = 1:npt
% Sample data
N = nOrder(k) ;
A = randi(5000,[1,N]) ;
% Benchmark
f1 = @() findDuplicates_histMethod(A) ;
f2 = @() findDuplicates_histcountMethod(A) ;
f3 = @() findDuplicates_sortMethod(A) ;
f4 = @() findDuplicates_expansionMethod(A) ;
f5 = @() findDuplicates_accumarrayMethod(A) ;
f6 = @() findDuplicates_sparseMethod(A) ;
ExecTimes(k,1) = timeit( f1 ) ;
ExecTimes(k,2) = timeit( f2 ) ;
ExecTimes(k,3) = timeit( f3 ) ;
ExecTimes(k,4) = timeit( f4 ) ;
ExecTimes(k,5) = timeit( f5 ) ;
ExecTimes(k,6) = timeit( f6 ) ;
clear A
disp(N)
end
function d = findDuplicates_histMethod(A)
U = unique(A);
[co,ce] = hist(A,U);
an = ce(co>1);
d=[];
for i=1:numel(an)
d=[d,find(A==an(i))];
end
end
function d = findDuplicates_histcountMethod(A)
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
d = idx_A(idxkeep);
end
function d = findDuplicates_sortMethod(A)
[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
if ~isempty(dx)
d = I([dx dx([diff(dx)~=1,true])+1]) ;
else
d=[];
end
end
function d = findDuplicates_expansionMethod(A)
Ae = ones(numel(A),1) * A ;
d = find(sum(Ae==Ae.')>1) ;
end
function d = findDuplicates_accumarrayMethod(A)
d = find(ismember(A, find(accumarray(A(:), 1)>1))) ;
end
function d = findDuplicates_sparseMethod(A)
d = find(ismember(A, find(sparse(A, 1, 1)>1)));
end
end
Upvotes: 6
Reputation: 112659
I'm a little late to the party, but this question is begging for an accumarray
-based solution :-)
d = find(ismember(A, find(accumarray(A(:), 1)>1)));
This exploits the fact that A
contains small, positive integers, so they can be interpreted as indices. It works as follows:
accumarray(A(:), 1) % count of occurrences of each value
find( >1) % values occurring more than once
d = find(ismember(A, ); % their positions in A
Alternatively, sparse
can be used instead of accumarray
:
d = find(ismember(A, find(sparse(A, 1, 1)>1)));
Upvotes: 4
Reputation: 35525
For completeness, here its the results of the other answers, compared to yours, and yours accelerated (which I was working on before better people came to your rescue). For the sizes in your question:
for ii=1:100
a=randi(5000,1,200);
t1(ii)=timeit(@()yours(a));
a=randi(5000,1,200);
t2(ii)=timeit(@()faster(a));
a=randi(5000,1,200);
t3(ii)=timeit(@()hoki(a));
a=randi(5000,1,200);
t4(ii)=timeit(@()am304(a));
end
disp(['Faster: x', num2str(mean(t1)/mean(t2))])
disp(['hoki: x', num2str(mean(t1)/mean(t3))])
disp(['am304: x', num2str(mean(t1)/mean(t4))])
disp(['Faster: x', num2str(t1/t2)])
disp(['hoki: x', num2str(t1/t3)])
disp(['am304: x', num2str(t1/t4)])
function d = yours(A)
U = unique(A);
[co,ce] = hist(A,U);
an = ce(co>1);
d=[];
for i=1:numel(an)
d=[d,find(A==an(i))];
end
end
function d = faster(A)
[co] = histcounts(A,max(A));
an = co>1;
d=[];
for i=1:numel(an)
d=[d,find(A==an(i))];
end
end
function res=am304(A)
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
res = idx_A(idxkeep);
end
function res=hoki(A)
[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
res = I([dx dx+1]) ;
end
The results are:
Faster: x0.0054505
hoki: x7.4142
am304: x1.0881
My faster version fails miserably for this case.
I get that Hoki's answer is the fastest by a small margin in large arrays, but much faster in small ones, and that depending on the size and range of a
, it is 2~30 times faster (together with am304's answer).
Upvotes: 2
Reputation: 13876
Here's a solution (credit: this was adapted from https://uk.mathworks.com/matlabcentral/answers/175086-finding-non-unique-values-in-an-array)
A = [4, 12, 9, 8, 9, 12, 7, 1];
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
idx_dup = idx_A(idxkeep);
It gives the following:
>> idx_dup = idx_A(idxkeep)
idx_dup =
2 3 5 6
Not sure if it's more or less efficient than your current solution. You probably need to test it with realistic data.
Upvotes: 3