Reputation: 3131
I have two 2D arrays of size up to 35,000*35,000
each: indices
and dotPs
. From this, I want to create two 1D arrays such that pop
contains the number of times each number appears in indices
and nn
contains the sum of elements in dotPs
that correspond to those numbers. I have come up with the following (really dumb) way:
dotPs = [81.4285 9.2648 46.3184 5.7974 4.5016 2.6779 16.0092 41.1426;
9.2648 24.3525 11.4308 14.6598 17.9558 23.4246 19.4837 14.1173;
46.3184 11.4308 92.9264 9.2036 2.9957 0.1164 26.5770 26.0243;
5.7974 14.6598 9.2036 34.9984 16.2352 19.4568 31.8712 5.0732;
4.5016 17.9558 2.9957 16.2352 19.6595 16.0678 3.5750 16.7702;
2.6779 23.4246 0.1164 19.4568 16.0678 25.1084 6.6237 15.6188;
16.0092 19.4837 26.5770 31.8712 3.5750 6.6237 61.6045 16.6102;
41.1426 14.1173 26.0243 5.0732 16.7702 15.6188 16.6102 47.3289];
indices = [3 2 1 1 2 1 2 1;
2 2 1 2 2 1 2 2;
1 1 3 3 2 2 2 2;
1 2 3 4 3 3 4 2;
2 2 2 3 3 1 3 2;
1 1 2 3 1 8 2 2;
2 2 2 4 3 2 4 2;
1 2 2 2 2 2 2 2];
nn = zeros(1,8);
pop = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
j = uniqueInd(k);
[I,J]=find(indices==j);
if j == 0 || numel(I) == 0
continue
end
pop(j) = pop(j) + numel(I);
nn(j) = nn(j) + sum(sum(dotPs(I,J)));
end
Because of the find
function, this is very slow. How can I do this more smartly so that it runs in a few seconds rather than several minutes?
Edit: added small dummy matrices for testing the code.
Upvotes: 1
Views: 119
Reputation: 112699
Both tasks can be done with the accumarray
function:
pop = accumarray(indices(:), 1, [max(indices(:)) 1]).';
nn = accumarray(indices(:), dotPs(:), [max(indices(:)) 1]).';
This assumes that indices
only contains positive integers.
EDIT:
From comments, only the lower part of the indices
matrix without the diagonal should be used, and it is guaranteed to contain positive integers. In that case:
mask = tril(true(size(indices)), -1);
indices_masked = indices(mask);
dotPs_masked = dotPs(mask);
pop = accumarray(indices_masked, 1, [max(indices_masked) 1]).';
nn = accumarray(indices_masked, dotPs_masked, [max(indices_masked) 1]).';
Upvotes: 3
Reputation: 32104
For computing pop
, you can use hist, for computing nn
, I couldn't find a smart solution (but I found a solution without using find
):
pop = hist(indices(:), max(indices(:)));
nn = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
j = uniqueInd(k);
nn(j) = sum(dotPs(indices == j));
end
There must be a better solution for computing nn
.
I found a smarter solution applying sorting.
I am not sure it's faster, because sorting 35,000*35,000 elements might take a long time.
indices
just for getting the index for sorting dotPs
by indices
. dotPs
according to index returned by previous sort. pop
(cumulative sum of the histogram of indices
). cumsumPs = Compute cumulative sum of sorted dotPs
.
Now values of cumsumPop can be used as indices in cumsumPs.
Because cumsumPs is cumulative sum, we need to use diff
for getting the solution.
Here is the "smart" solution:
pop = hist(indices(:), max(indices(:)));
[sortedIndices, I] = sort(indices(:));
sortedDotPs = dotPs(I);
cumsumPop = cumsum(pop);
cumsumPs = cumsum(sortedDotPs);
nn = diff([0; cumsumPs(cumsumPop)]);
nn = nn';
Upvotes: 1
Reputation: 3460
First of all, note that the dimension of indices
does not matter (e.g. if both indices
and dotPs
were 1D arrays or 3D arrays the result will be the same).
pop
can be calculated by histcount
function, but since you also need to calculate the sum of the corresponding elements of dotPs
array the problem becomes harder.
Here is a possible solution with a for
loop. The advantage of this solution is that I am not calling find
function in a loop, so it should be faster:
%Example input
indices=randi(5,3,3);
dotPs=rand(3,3);
%Solution
[C,ia,ic]=unique(indices);
nn=zeros(size(C));
pop=zeros(size(C));
for i=1:numel(indices)
nn(ic(i))=nn(ic(i))+1;
pop(ic(i))=pop(ic(i))+dotPs(i);
end
This solution uses a vector ic
to categorize each of the input values. After that, I go through each element and update nn(ic)
and pop(ic)
.
Upvotes: 1