Efficient way to assign each element in a large vector to the number of repeats

Question

I have a vector of about 15 million integers (int32) containing about 1 million unique values. I would like to find an efficient method of counting the repeated integers and relating them back to the original vector so that I have a new vector containing counts for each element.

That's probably clear as mud, so here's a small example of what I'm looking for:

A      = [1 1 1 3 2 5 2 1 2 6];
...
result = [4 4 4 1 3 1 3 4 3 1];

My (impractically slow ) implementation is as follows (Note that in my version of matlab, hist does not work with integers):

A = randi(1e6,[15e6,1],'int32');
result = zeros(size(A),'int32');
[uniqueA,~,iuA] = unique(A);
counts = accumarray(iuA,1);

So far, so good: uniqueA contains a list of the unique elements of A, and counts contains a list of the corresponding number of each. This is pretty quick.

Next comes the slow part. I have tried the following to retrieve the indices of each element:

cellIndex = arrayfun(@(x) A == x, uniqueA,'UniformOutput',false);

but this runs out of memory (with 16 GB ram) and grinds to a halt when it starts swapping. To avoid this I tried looping over the unique elements (1 million), which is also slow:

for n = 1:length(uA)
    result(A == uA(n)) = counts(n);
end

I don't know how long this takes, because I've been waiting half an hour and it still isn't finished.

Any ideas on how I can accomplish my task efficiently?

Divakar · Accepted Answer

One approach -

[unq,~,idx] = unique(A);
out = changem(idx,histc(A,unq),1:max(idx))

Sample run -

>> A
A =
     1     1     1     3     2     5     2     1     2     6
>> [unq,~,idx] = unique(A);
>> changem(idx,histc(A,unq),1:max(idx))
ans =
     4     4     4     1     3     1     3     4     3     1

Here's a simpler version -

[unq,~,idx] = unique(A);
counts = histc(A,unq);
out = counts(idx)

Sample run -

>> A
A =
     1     1     1     3     2     5     2     1     2     6
>> [unq,~,idx] = unique(A);
>> counts = histc(A,unq);
>> counts(idx)
ans =
     4     4     4     1     3     1     3     4     3     1

Efficient way to assign each element in a large vector to the number of repeats

Answers (1)

Related Questions