Reputation: 35301
Suppose I have some table
(or dataset
) object A
that includes some factor X
(having n distinct values, or "levels") among its variables. Also suppose I have some custom aggregating function agg
that takes as input a sequence of rows, and returns a single summary number (aka "aggregate") as output.
An example of an extremely common operation in statistically data analysis would be to group the rows of A
according to the value of the X
factor, and to apply agg
to the arrays consisting of the rows in the group. The result of such an operation should be a new table
(or dataset
, depending on the class of A
), with n rows and two variables. The first one of these variables should be named X
, and should contain the n distinct values of the X
factor in A
, and the second should have some suitable (preferably user-specified) name, and should contain the result of applying agg
to the row groups for the corresponding X
levels.
As I said, this is a very standard operation to perform on data structures such as MATLAB's table
and dataset
objects, so I was expecting there'd be a built-in way to do it, but I'm not finding it.
For example, let A
be as defined below:
% "data" table
A = cell2table({
'even', 'red', 'spades', 38, 0.9837;
'even', 'red', 'hearts', 19, 0.5695;
'even', 'red', 'diamonds', 89, 0.2629;
'even', 'red', 'diamonds', 98, 0.3578;
'even', 'red', 'diamonds', 92, 0.2596;
'even', 'red', 'diamonds', 69, 0.5751;
'even', 'red', 'diamonds', 77, 0.6318;
'even', 'yellow', 'clubs', 22, 0.6917;
'even', 'green', 'spades', 35, 0.6674;
'even', 'green', 'hearts', 67, 0.7896;
'even', 'green', 'hearts', 49, 0.5025;
'even', 'green', 'hearts', 64, 0.5318;
'odd', 'red', 'spades', 22, 0.5587;
'odd', 'red', 'hearts', 51, 0.9122;
'odd', 'red', 'diamonds', 74, 0.3343;
'odd', 'red', 'diamonds', 69, 0.2911;
'odd', 'yellow', 'spades', 33, 0.2653;
'odd', 'yellow', 'spades', 38, 0.2549;
'odd', 'yellow', 'diamonds', 1, 0.2064;
'odd', 'yellow', 'diamonds', 25, 0.8257;
'odd', 'green', 'spades', 64, 0.4348;
'odd', 'green', 'hearts', 59, 0.8644;
'odd', 'green', 'hearts', 4, 0.6374;
'odd', 'green', 'hearts', 11, 0.3354
}, 'VariableNames', ...
{'Parity', 'TrafficLight', 'Suit', 'order', 'prevalence'});
Also, let X
be TrafficLight
, and agg
be
agg = @(t) size(t, 1);
(Of course, I'm using this agg
here just to keep the example as simple as possible. In practice agg
will be far less simple-minded.)
The typical group_aggregate
function I'm thinking of usually takes as input arguments (in some order) an aggregating function, the name of the column in the output for the computed aggregates, a table
(or dataset
), and the names of one or more variables to group by. Therefore, in this example, the call to such a function, and its output, would look something like this
>> group_aggregate(agg, 'nrows', A, {'TrafficLight'})
ans =
TrafficLight nrows
____________ _____
'green' 8
'red' 11
'yellow' 5
BTW, to get the result above, I conjured up this desperate little beast:
>> tmp = cellfun(@(s) {s agg(A(strcmp(A.TrafficLight, s), :))}, ...
unique(A.TrafficLight), 'un', 0);
>> cell2table(cat(1, tmp{:}), 'VariableNames', {'TrafficLight' 'nrows'})
I hope that a built-in solution is more robust to, for example, different classes of values for the X
variable, etc.
Upvotes: 1
Views: 2307
Reputation: 83
I know the question is 1 year old...
I use a different way that accumarray. I use tabulate http://www.mathworks.com/help/stats/tabulate.html?s_tid=gn_loc_drop
I also still have an older version of Matlab and I use therefore dataset. In this case (but I think it might also work for 'table'), your code would be:
TrafficLight_stats=tabulate(A.TrafficLight);
Agg_table=mat2dataset(TrafficLight_stats,'VarNames',{'TrafficLight','nrows','Perc'});
Agg_table =
TrafficLight nrows Perc
'red' [11] [45.8333]
'yellow' [ 5] [20.8333]
'green' [ 8] [33.3333]
Upvotes: 0
Reputation: 30579
I don't know of a table
method just for this purpose, but for aggregation, look to accumarray
:
>> [lights,ia,ic]=unique(A.TrafficLight);
>> nrows = accumarray(ic,1);
>> cell2table([lights num2cell(nrows)],'VariableNames', {'TrafficLight' 'nrows'})
ans =
TrafficLight nrows
____________ _____
'green' 8
'red' 11
'yellow' 5
Upvotes: 2