Sum column values based on multiple criteria in data table j

Question

I have the following data table data1 (more lines and more variations but this is a sample):

    item cat1 cat2 cat3 amounts
 1:    1   99 9999 9990     100
 2:    2   12 8199 9990     100
 3:    3   12 8199 9990     100
 4:    4   12 8199 9990     100
 5:    5   12 8199 9990     100
 6:    6   12 8199 9990     100
 7:    7   12 8199 9990     100
 8:    8   12 4129 9990     100
 9:    9   12 8199 9990     100
10:   10   12 8199 9990     100

library(data.table)
data1 <- setDT(structure(list(item = 1:10, cat1 = c("99", "12", "12", "12", 
"12", "12", "12", "12", "12", "12"), cat2 = c("9999", "8199", 
"8199", "8199", "8199", "8199", "8199", "4129", "8199", "8199"
), cat3 = c("9990", "9990", "9990", "9990", "9990", "9990", "9990", 
"9990", "9990", "9990"), amounts = c("100", "100", "100", "100", 
"100", "100", "100", "100", "100", "100")), .Names = c("item", 
"cat1", "cat2", "cat3", "amounts"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L)))

Initially I wanted to get some information about the lines that fulfilled some criteria based on cat1, cat2, cat3. And so I did something like this:

data1[, .(  items = .N,
            group1 = sum(grepl("^[1-8]{2}$", cat1)),
            group2 = sum(grepl("^[1-8]9$", cat1)),
            group3 = sum(grepl("^9[1-8]$", cat1)),
            group4 = sum(cat1 == "99"))]

With the result being:

    items group1 group2 group3 group4
1:    10      9      0      0      1

There were lots of other criteria included in the analysis but this is also just a sample. My requirements changed and now for every group that is specified I need to sum also the amounts. So I have two questions:

1) Is there a way in data table to do this summation in a similar way to the one for calculating the counts (so the basic idea would be something like sum(amounts) where grepl("^[1-8]{2}$", cat1) )

2) Is there an efficient way of doing this that I am missing? I cannot think of any good way of getting my result aside from adding new columns to the original dataset for each criterion that I have and then doing a filtered sum.

My ideal result would be:

    items group1 group2 group3 group4 total_amounts  group1_amounts   group2_amounts group3_amounts group4_amounts
1:    10      9      0      0      1          1000            900              0              0            100

Frank · Accepted Answer

I'd say: make a table for the full mapping to groups (which are mutually exclusive):

m = data.table(g = paste0("group", 1:4))[,.(cat1 = as.character(
  if (.GRP==1L) combn(1:8, 2, paste0, collapse = "") else
  if (.GRP==2L) paste0(1:8, 9) else
  if (.GRP==3L) paste0(9, 1:8) else
  if (.GRP==4L) "99"
)), by=g]

Verify that the mapping is...

stopifnot(m[duplicated(cat1), .N == 0L])   # mutually exclusive
stopifnot(data1[!m, on=.(cat1), .N == 0L]) # exhaustive

Add group as a variable to the main table:

data1[m, on=.(cat1), g := i.g]

Join on groups to summarize by each group:

res = data1[.(g = unique(m$g)), on=.(g), .(.N, tot_amt = sum(as.numeric(amounts), na.rm=TRUE)), by=.EACHI]

#         g N tot_amt
# 1: group1 9     900
# 2: group2 0       0
# 3: group3 0       0
# 4: group4 1     100

I think this is the more useful format for the output, but if you really want the one-line output...

cbind(N = sum(res$N), dcast(res, . ~ g, value.var=c("N","tot_amt")))[, !"."]

#     N N_group1 N_group2 N_group3 N_group4 tot_amt_group1 tot_amt_group2 tot_amt_group3 tot_amt_group4
# 1: 10        9        0        0        1            900              0              0            100

How the "join" step works

The syntax is x[i, on=cols, j, by=.EACHI], where i is a list or data.table.

.() is an alias for list() inside some arguments of x[...].
by=.EACHI means group by each row of i (even for rows of i unmatched in x).
As usual, j is computed for each group determined in by=.

Comments

When making the group column, I think it's much better to make a table instead of using a series of regexes, like data1[grepl(yada), g := 1L][grepl(yada2) & is.na(g), g := 2L][grepl(yada3) & is.na(g), g := 3L] since with the latter, there's no way to catch cases where there is duplicated assignment (cat1 assigned to multiple g) or missed assignment (cat1 assigned to no g).

And when summarizing by group, I think it's best to do data1[.(all_groups), on=.(g), ..., by=.EACHI] rather than data1[, ..., by=g] since the latter will skip groups that happen to not appear in the table.

Sum column values based on multiple criteria in data table j

Answers (1)

Related Questions