andschar
andschar

Reputation: 3973

data.table aggregate by group and keep corresponding values from other columns

I want to aggregate values of a data.table in R via multiple functions by a grouping variable AND keep the information that is in other columns (not included in the aggregation) in the corresponding rows (=the same row as the aggregation). An example:

Note: The code uses this which_quantile()-function (using sort(x) instead of order(x) in its code). It finds an actual value of a data set which is close to the defined quantile.

# sample data
dt <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
.Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
"B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
104L, 37L, 24L, 19L, 118L, 88L, 42L), Number=c(1L,2L,3L,4L,5L,6L,7L,8L), Number2=c(9,10,11,12,13,14,15,16)),
.Names = c("State", "Company", "Employees", "Number", "Number2"), class = "data.frame", row.names = c(NA, 8L))

require(data.table)
setDT(dt)

# aggregation
agg <- dt[ , .(max = max(Employees),
               min = min(Employees),
               quantile70 = which.quantile(Employees, 0.7)), by=State]
agg_m <- dt[agg, on="State"]

aggregating a DT leads to the following output:

     State  max   min   quantile70
1:    AK    104   24    82
2:    RI    118   19    88

merging the aggregation with the original DT to that:

     State Company Employees Number Number2    max      min quantile70
1:    AK       A        82      1       9      104       24     82
2:    AK       B       104      2      10      104       24     82
3:    AK       C        37      3      11      104       24     82
4:    AK       D        24      4      12      104       24     82
5:    RI       E        19      5      13      118       19     88
6:    RI       F       118      6      14      118       19     88
7:    RI       G        88      7      15      118       19     88
8:    RI       H        42      8      16      118       19     88

Question: How can I aggregate the data.table and at the same time keep the corresponding values in the columns Company, Number and Number2? The max. of the Number column in the State AK is 104 and the corresponding value in column Number2 is 10. The min is 24 and the corresponding value 12 and so on. How to keep this information when aggregating a data.table?

desired output:

    State Company Employees Number Number2 aggregation
1:    AK       A        82      1       9      quantile70
2:    AK       B       104      2      10      max
3:    AK       D        24      4      12      min
4:    RI       E        19      5      13      min
5:    RI       F       118      6      14      max
6:    RI       H        88      8      16      quantile70

The question is similar to this one. The sample data was also taken from there and adapted.


The following aggregations don't solve my problem:

dt[ ,.SD[ which.max(Employees) ], by=State]
dt[dt[ ,.I[ which.max(Employees) ], by=State ]$V1]
# only which.max() OR which.min() are possible

dt[ , max_Empl := max(Employees), by=State ]
# only ONE aggregation function at a time is possible

Upvotes: 2

Views: 4121

Answers (1)

Frank
Frank

Reputation: 66819

Following @eddi's canonical answer on subsetting by group...

aggi <- dt[ , .(max = .I[which.max(Employees)],
               min = .I[which.min(Employees)],
               quantile70 = .I[which.quantile(Employees, 0.7)]), by=State]

From here, you can do

maggi <- melt(aggi, id="State")

dt[maggi$value][, v := maggi$variable][]

   State Company Employees Number Number2          v
1:    AK       B       104      2      10        max
2:    RI       F       118      6      14        max
3:    AK       D        24      4      12        min
4:    RI       E        19      5      13        min
5:    AK       A        82      1       9 quantile70
6:    RI       G        88      7      15 quantile70

Upvotes: 1

Related Questions