Alex
Alex

Reputation: 15708

What is the fastest/most efficient/best way to compress 'grouped' table data in Matlab

I frequently have a table with a combination of columns that acts as a grouping key/common identifier, such that the key may be repeated across rows. A simple example:

sampleId = [1 1 1 3 3 3]';
entity = [1 2 3 1 4 5]';
dataTable = table(sampleId, entity)

where here, the observations of entity can be thought to be attached to sample 1, and 3.

I find it useful to compress this data so that the key is unique amongst the rows, for example, I would like a final table that looks like:

----------------------------
|  sampleId  |  entity     |
----------------------------
|      1     |  3x1 table  |
|      3     |  3x1 table  |
----------------------------

The only way I know of doing this is to use a for loop as follows:

tempCell = cell(length(unique(dataTable.sampleId)), 1);
counter = 1;
nonGroupVariables = dataTable.Properties.VariableNames(...
                    ~ismember(dataTable.Properties.VariableNames,'sampleId'));


for sampleId = unique(dataTable.sampleId)'

    tempCell(counter) = {dataTable(dataTable.sampleId == sampleId, nonGroupVariables)};

    counter = counter + 1;

end

newDataTable = table(unique(dataTable.sampleId), tempCell, 'VariableNames', ['sampleId', nonGroupVariables]);

Are there any better ways (more efficient/faster) of achieving this, maybe using accummarray, or grouping?

Upvotes: 2

Views: 106

Answers (2)

Luis Mendo
Luis Mendo

Reputation: 112699

You can use accumarray indeed. I'll distinguish two cases:

  • The table has n+1 columns. The first n are the grouping variables, and the last column is the data variable.
  • The table has an n+m of columns. The first n are the grouping variables, and the last m are data variables.

Of course the second case includes the first, but it's easier to consider the first and then proceed with the second.

n grouping variables, 1 data variable

sampleId  = [1 1 1 3 3 3]';
sampleId2 = [1 1 2 3 2 2]';
entity    = [1 2 3 1 4 5]'; %'
dataTable = table(sampleId, sampleId2, entity); %// example data
n = 2; %// number of grouping variables

[u, ~, v] = unique(dataTable{:,1:n}, 'rows');
c = accumarray(v, dataTable{:,n+1}, [], @(x) {x}); %// cell array of vectors,
    %// where each vector refers to one value of the grouping variable
ut = mat2cell(u, size(u,1), ones(1,n)); %// convert to cell array
compressedTable = [table(ut{:}, 'VariableNames', dataTable.Properties.VariableNames(1:n)) ...
    cell2table(c, 'VariableNames', dataTable.Properties.VariableNames(n+1))];
    %// create output table with correct variable names

This produces a table such that

  • The first n columns contain the unique combinations of the grouping variables, that is, of the first n columns in the original table.
  • The last column contains, in each row, a cell with a numeric vector. The numeric vector contains all values corresponding to the combination of the grouping variables given by that row.

Note that curly-bracket indexing into the table is used to make the code independendent of the table variable names. In the above example, the result is

>> compressedTable
compressedTable = 
    sampleId    sampleId2       entity   
    ________    _________    ____________
    1           1            [2x1 double]
    1           2            [         3]
    3           2            [2x1 double]
    3           3            [         1]

>> compressedTable.entity{1}
ans =
     2
     1
>> compressedTable.entity{2}
ans =
     3
>> compressedTable.entity{3}
ans =
     4
     5
>> compressedTable.entity{4}
ans =
     1

n grouping variables, m data variables

In this case you probably need to loop over the columns other than the first. In the following I use arrayfun for the looping.

sampleId  = [1 1 1 3 3 3]';
sampleId2 = [1 1 2 3 2 2]';
entity    = [1 2 3 1 4 5]'; %'
entity2   = entity*2;
dataTable = table(sampleId, sampleId2, entity, entity2); %// example data
n = 2; %// number of grouping variables

[u, ~, v] = unique(dataTable{:,1:n}, 'rows');
c = arrayfun(@(n) accumarray(v, dataTable{:,n}, [], @(x) {x}), n+1:size(dataTable,2), ...
    'uniformoutput', 0); %// cell array of cell arrays of vectors
ut = mat2cell(u, size(u,1), ones(1,n)); %// convert to cell array
compressedTable = [table(ut{:}, 'VariableNames', dataTable.Properties.VariableNames(1:n)) ...
    cell2table([c{:}], 'VariableNames', dataTable.Properties.VariableNames(n+1:end))];
    %// create output table with correct variable names

The result is

compressedTable = 
    sampleId    sampleId2       entity         entity2   
    ________    _________    ____________    ____________
    1           1            [2x1 double]    [2x1 double]
    1           2            [         3]    [         6]
    3           2            [2x1 double]    [2x1 double]
    3           3            [         1]    [         2]

>> compressedTable.entity{1}
ans =
     2
     1
>> compressedTable.entity2{1}
ans =
     4
     2
>> compressedTable.entity{2}
ans =
     3
>> compressedTable.entity2{2}
ans =
     6
>> compressedTable.entity{3}
ans =
     4
     5
>> compressedTable.entity2{3}
ans =
     8
    10
>> compressedTable.entity{4}
ans =
     1
>> compressedTable.entity2{4}
ans =
     2

Upvotes: 1

Alex
Alex

Reputation: 15708

I found another method using varfun:

compressedTable = varfun(@(x){x}, dataTable, 'GroupingVariables', 'sampleId');
compressedTable.GroupCount = [];
compressedTable.Properties.VariableNames = dataTable.Properties.VariableNames;

Upvotes: 0

Related Questions