MathUser
MathUser

Reputation: 163

Matrix indexing for fast sums by group

I am running into an out of memory issue when running simulations in MATLAB.

To give an easy example, let's say I have a MATLAB table / matrix / collection of vectors that look like this:

 id | t | var
----+---+-----
  1 | 1 | 100
  1 | 2 | 150
  2 | 2 | 200
  2 | 3 |  90
  2 | 4 | 980

where id denotes individuals, t denotes time periods, and var is a numerical variable.

I need to sum the different values of var for a given individual id over time t. The easiest way I could find was to do run the following:

idx    = sparse(id == id');
sumvar = idx*sumvar;

which produces the desired result (a vector with the same length as var where each element is the sum for a given id).

 id | t | var | sumvar 
----+---+-----+--------
  1 | 1 | 100 |   250  
  1 | 2 | 150 |   250  
  2 | 2 | 200 |  1270  
  2 | 3 |  90 |  1270   
  2 | 4 | 980 |  1270  

The problem is that computing idx is extremely computer intensive, and with vectors of the order of 150,000 my computer runs out of memory.

One potential solution is to use the following code:

len = length(id);
idx = sparse(len,len);
for i = 1:len
   idx(id == id(i),:) = 1;
end

But that seems to be quite slow.

I feel this is a problem someone else may have already faced. Is there anything that could be both non-computing intensive but also fast enough?

Upvotes: 2

Views: 86

Answers (3)

Wolfie
Wolfie

Reputation: 30047

For your example table T:

 id | t | var
----+---+-----
  1 | 1 | 100
  1 | 2 | 150
  2 | 2 | 200
  2 | 3 |  90
  2 | 4 | 980

You can use varfun

a = varfun( @sum, T, 'GroupingVariables', 'id', 'InputVariables', 'var' )

Result:

a =
2×3 table
id    GroupCount    sum_var
__    __________    _______
1     2              250   
2     3             1270  

Upvotes: 0

Luis Mendo
Luis Mendo

Reputation: 112659

You can try accumarray, as follows. Let your data be

id = [1 1 2 2 2].';
var = [100 150 200 90 980].';
  • Assuming id always contains integer entries starting at 1:

    result = accumarray(id, var);
    

    gives

    result =
             250
            1270
    
  • If id is arbitrary, use:

    [~, ~, id_int] = unique(id);
    result = accumarray(id_int, var);
    
  • If you need the results repeated as with your code, add:

    result_repeated = result(id_int);
    

Upvotes: 3

rinkert
rinkert

Reputation: 6863

You could try the following, still using a loop, but in a bit more efficient way by iterating over only the unique id's.

id = [1 1 2 2 2].';
var = [100 150 200 90 980].';

unique_ids = unique(id);    % get the unique ids
sum_var = NaN(size(var));   % init the sum_var vector

for k = unique_ids.'        % loop over the ids
    idx = find(id == k);    % find indices per id
    sum_var(idx) = sum(var(idx));   % sum per id
end

Or, if you just need a vector with the sum per id:

unique_ids = unique(id);           % get the unique idx
sum_var = NaN(size(unique_ids));   % init the sum_var vector

for k = 1:numel(unique_ids)
    idx = find(id == unique_ids(k));    % find indices per id
    sum_var(k) = sum(var(idx));         % sum per id
end

Update: It can also be done without find, by using the indices per unique element that unique can return. Assuming your data is sorted by id you can do the following:


[unique_ids, start_idx, ~] = unique(id);  % get the unique idx, and the first occuring idx per id

sum_var = NaN(size(unique_ids));

start_idx = [start_idx; numel(var)+1];    % append total number of elements+1 for last summation in loop below

for k = 1:numel(unique_ids)
    ids = start_idx(k):start_idx(k+1)-1;  % indices in table for specific id
    sum_var(k) = sum(var(ids));
end

Upvotes: 1

Related Questions