impopularGuy
impopularGuy

Reputation: 867

find all the indices of all the duplicate elements in Array

Given an Integer array, find all the indices of all the duplicate elements in it.

For example, consider A = [4, 12, 9, 8, 9, 12, 7, 1]. Since 12 and 9 have duplicates, all their indices will be returned i.e. d = [2, 3, 5, 6]. The length of A is below 200 and the integer elements are between 1 and 5000.

Currently I am using the following function. However to meet my requirements, this function is slowing me down. Is there any possible performance improvement ?

function d = fincDuplicates(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

Upvotes: 3

Views: 288

Answers (4)

Hoki
Hoki

Reputation: 11792

Edits:

1: Code corrected for edge case highlighted in comment, benchmark updated.

2: Added 'expansion' solution to benchmark (had to reduce max N element to 20000).

3: Added accumarray method to benchmark (Winner for high N), and sparse method.


Here's another way to get the result, without using the function unique or hist. It does relly on the function sort.

In an expanded form (if you want to see the result of the intermediate steps):

A = [4, 12, 9, 8, 9, 12, 7, 1] ;

[B,I] = sort(A) ;    % this will put duplicate elements side by side
df = diff(B) ;       % the duplicates will return '0' when substracted
dx = find(df==0) ;   % find their indices

% Since each duplicate concerns 2 elemts of the array, we add the next
% index for each "flagged" index, taking care not to duplicate the indices
% of sucessive duplicates.
if ~isempty(dx)
    dd = [diff(dx)~=1 , true] ;
    dx = [dx dx(dd)+1] ;
    d = I(dx)   % get the original position of the duplicates in original array
else
    d=[] ;
end

Which you could compact to:

[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
if ~isempty(dx)
    d = I([dx dx([diff(dx)~=1,true])+1]) ;
else
    d = [] ;
end

gives:

d =
     3     2     5     6

Personnaly I would also sort the returned indices but if it's not necessary and you're concerned about performance you can just accept the unsorted result.


Here's another benchmark (testing the number of element from 10 to 20000):

run on MATLAB R2016a

bench

And the code for it:

function ExecTimes = benchmark_findDuplicates

nOrder = (1:9).' * 10.^(1:3) ; nOrder = [nOrder(:) ; 10000 ; 20000 ] ;
npt = numel(nOrder) ;

ExecTimes = zeros(npt,6) ;

for k = 1:npt
    % Sample data
    N = nOrder(k) ;
    A = randi(5000,[1,N]) ;

    % Benchmark
    f1 = @() findDuplicates_histMethod(A) ;
    f2 = @() findDuplicates_histcountMethod(A) ;
    f3 = @() findDuplicates_sortMethod(A) ;
    f4 = @() findDuplicates_expansionMethod(A) ;
    f5 = @() findDuplicates_accumarrayMethod(A) ;
    f6 = @() findDuplicates_sparseMethod(A) ;
    ExecTimes(k,1) = timeit( f1 ) ;
    ExecTimes(k,2) = timeit( f2 ) ;
    ExecTimes(k,3) = timeit( f3 ) ;
    ExecTimes(k,4) = timeit( f4 ) ;
    ExecTimes(k,5) = timeit( f5 ) ;
    ExecTimes(k,6) = timeit( f6 ) ;

    clear A
    disp(N)
end

function d = findDuplicates_histMethod(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function d = findDuplicates_histcountMethod(A)
    [~,idxu,idxc] = unique(A);
    [count, ~, idxcount] = histcounts(idxc,numel(idxu));
    idxkeep = count(idxcount)>1;
    idx_A = 1:length(A);
    d = idx_A(idxkeep);
end

function d = findDuplicates_sortMethod(A)
    [B,I] = sort(A) ;
    dx = find(diff(B)==0) ;
    if ~isempty(dx)
        d = I([dx dx([diff(dx)~=1,true])+1]) ;
    else
        d=[];
    end
end

function d = findDuplicates_expansionMethod(A)
    Ae = ones(numel(A),1) * A ;
    d = find(sum(Ae==Ae.')>1) ;
end

function d = findDuplicates_accumarrayMethod(A)
    d = find(ismember(A, find(accumarray(A(:), 1)>1))) ;
end

function d = findDuplicates_sparseMethod(A)
    d = find(ismember(A, find(sparse(A, 1, 1)>1)));
end

end

Upvotes: 6

Luis Mendo
Luis Mendo

Reputation: 112659

I'm a little late to the party, but this question is begging for an accumarray-based solution :-)

d = find(ismember(A, find(accumarray(A(:), 1)>1)));

This exploits the fact that A contains small, positive integers, so they can be interpreted as indices. It works as follows:

                          accumarray(A(:), 1)      % count of occurrences of each value
                     find(                   >1)   % values occurring more than once
d = find(ismember(A,                            ); % their positions in A

Alternatively, sparse can be used instead of accumarray:

d = find(ismember(A, find(sparse(A, 1, 1)>1)));

Upvotes: 4

Ander Biguri
Ander Biguri

Reputation: 35525

For completeness, here its the results of the other answers, compared to yours, and yours accelerated (which I was working on before better people came to your rescue). For the sizes in your question:

for ii=1:100
    a=randi(5000,1,200);
    t1(ii)=timeit(@()yours(a));

    a=randi(5000,1,200);
    t2(ii)=timeit(@()faster(a));

    a=randi(5000,1,200);
    t3(ii)=timeit(@()hoki(a));

    a=randi(5000,1,200);
    t4(ii)=timeit(@()am304(a));
end
disp(['Faster: x', num2str(mean(t1)/mean(t2))])
disp(['hoki: x', num2str(mean(t1)/mean(t3))])
disp(['am304: x', num2str(mean(t1)/mean(t4))])
disp(['Faster: x', num2str(t1/t2)])
disp(['hoki: x', num2str(t1/t3)])
disp(['am304: x', num2str(t1/t4)])
function d = yours(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function d = faster(A)
    [co] = histcounts(A,max(A));
    an = co>1;
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function res=am304(A)
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
res = idx_A(idxkeep);
end

function res=hoki(A)
[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
res = I([dx dx+1]) ;
end

The results are:

Faster: x0.0054505
hoki: x7.4142
am304: x1.0881

My faster version fails miserably for this case.

I get that Hoki's answer is the fastest by a small margin in large arrays, but much faster in small ones, and that depending on the size and range of a, it is 2~30 times faster (together with am304's answer).

Upvotes: 2

am304
am304

Reputation: 13876

Here's a solution (credit: this was adapted from https://uk.mathworks.com/matlabcentral/answers/175086-finding-non-unique-values-in-an-array)

A = [4, 12, 9, 8, 9, 12, 7, 1];
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
idx_dup = idx_A(idxkeep);

It gives the following:

>> idx_dup = idx_A(idxkeep)

idx_dup =

     2     3     5     6

Not sure if it's more or less efficient than your current solution. You probably need to test it with realistic data.

Upvotes: 3

Related Questions