sodiumnitrate
sodiumnitrate

Reputation: 3131

Efficient way to generate histogram from very large dataset in MATLAB?

I have two 2D arrays of size up to 35,000*35,000 each: indices and dotPs. From this, I want to create two 1D arrays such that pop contains the number of times each number appears in indices and nn contains the sum of elements in dotPs that correspond to those numbers. I have come up with the following (really dumb) way:

dotPs = [81.4285    9.2648   46.3184    5.7974    4.5016    2.6779   16.0092   41.1426;
      9.2648   24.3525   11.4308   14.6598   17.9558   23.4246   19.4837   14.1173;
     46.3184   11.4308   92.9264    9.2036    2.9957    0.1164   26.5770   26.0243;
      5.7974   14.6598    9.2036   34.9984   16.2352   19.4568   31.8712    5.0732;
      4.5016   17.9558    2.9957   16.2352   19.6595   16.0678    3.5750   16.7702;
      2.6779   23.4246    0.1164   19.4568   16.0678   25.1084    6.6237   15.6188;
     16.0092   19.4837   26.5770   31.8712    3.5750    6.6237   61.6045   16.6102;
     41.1426   14.1173   26.0243    5.0732   16.7702   15.6188   16.6102   47.3289];

indices = [3     2     1     1     2     1     2     1;
           2     2     1     2     2     1     2     2;
           1     1     3     3     2     2     2     2;
           1     2     3     4     3     3     4     2;
           2     2     2     3     3     1     3     2;
           1     1     2     3     1     8     2     2;
           2     2     2     4     3     2     4     2;
           1     2     2     2     2     2     2     2];


nn = zeros(1,8);
pop = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
    j = uniqueInd(k);
    [I,J]=find(indices==j);
    if j == 0 || numel(I) == 0
        continue
    end

    pop(j) = pop(j) + numel(I);
    nn(j) = nn(j) + sum(sum(dotPs(I,J)));
end

Because of the find function, this is very slow. How can I do this more smartly so that it runs in a few seconds rather than several minutes?

Edit: added small dummy matrices for testing the code.

Upvotes: 1

Views: 119

Answers (3)

Luis Mendo
Luis Mendo

Reputation: 112699

Both tasks can be done with the accumarray function:

pop = accumarray(indices(:), 1, [max(indices(:)) 1]).';
nn = accumarray(indices(:), dotPs(:), [max(indices(:)) 1]).';

This assumes that indices only contains positive integers.


EDIT:

From comments, only the lower part of the indices matrix without the diagonal should be used, and it is guaranteed to contain positive integers. In that case:

mask = tril(true(size(indices)), -1);
indices_masked = indices(mask);
dotPs_masked = dotPs(mask); 
pop = accumarray(indices_masked, 1, [max(indices_masked) 1]).';
nn = accumarray(indices_masked, dotPs_masked, [max(indices_masked) 1]).';

Upvotes: 3

Rotem
Rotem

Reputation: 32104

For computing pop, you can use hist, for computing nn, I couldn't find a smart solution (but I found a solution without using find):

pop = hist(indices(:), max(indices(:)));

nn = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
    j = uniqueInd(k);
    nn(j) = sum(dotPs(indices == j));
end

There must be a better solution for computing nn.


I found a smarter solution applying sorting.

I am not sure it's faster, because sorting 35,000*35,000 elements might take a long time.

  1. Sort indices just for getting the index for sorting dotPs by indices.
  2. Sort dotPs according to index returned by previous sort.
  3. cumsumPop = Compute cumulative sum of pop (cumulative sum of the histogram of indices).
  4. cumsumPs = Compute cumulative sum of sorted dotPs.

  5. Now values of cumsumPop can be used as indices in cumsumPs.
    Because cumsumPs is cumulative sum, we need to use diff for getting the solution.

Here is the "smart" solution:

pop = hist(indices(:), max(indices(:)));

[sortedIndices, I] = sort(indices(:));
sortedDotPs = dotPs(I);

cumsumPop = cumsum(pop);
cumsumPs = cumsum(sortedDotPs);

nn = diff([0; cumsumPs(cumsumPop)]);
nn = nn';

Upvotes: 1

Mikhail Genkin
Mikhail Genkin

Reputation: 3460

First of all, note that the dimension of indices does not matter (e.g. if both indices and dotPs were 1D arrays or 3D arrays the result will be the same).

pop can be calculated by histcount function, but since you also need to calculate the sum of the corresponding elements of dotPs array the problem becomes harder.

Here is a possible solution with a for loop. The advantage of this solution is that I am not calling find function in a loop, so it should be faster:

%Example input
indices=randi(5,3,3);
dotPs=rand(3,3);

%Solution
[C,ia,ic]=unique(indices);
nn=zeros(size(C));
pop=zeros(size(C));
for i=1:numel(indices)
    nn(ic(i))=nn(ic(i))+1;
    pop(ic(i))=pop(ic(i))+dotPs(i);
end

This solution uses a vector ic to categorize each of the input values. After that, I go through each element and update nn(ic) and pop(ic).

Upvotes: 1

Related Questions