Reputation: 3062
I have the following data table data1 (more lines and more variations but this is a sample):
item cat1 cat2 cat3 amounts
1: 1 99 9999 9990 100
2: 2 12 8199 9990 100
3: 3 12 8199 9990 100
4: 4 12 8199 9990 100
5: 5 12 8199 9990 100
6: 6 12 8199 9990 100
7: 7 12 8199 9990 100
8: 8 12 4129 9990 100
9: 9 12 8199 9990 100
10: 10 12 8199 9990 100
library(data.table)
data1 <- setDT(structure(list(item = 1:10, cat1 = c("99", "12", "12", "12",
"12", "12", "12", "12", "12", "12"), cat2 = c("9999", "8199",
"8199", "8199", "8199", "8199", "8199", "4129", "8199", "8199"
), cat3 = c("9990", "9990", "9990", "9990", "9990", "9990", "9990",
"9990", "9990", "9990"), amounts = c("100", "100", "100", "100",
"100", "100", "100", "100", "100", "100")), .Names = c("item",
"cat1", "cat2", "cat3", "amounts"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L)))
Initially I wanted to get some information about the lines that fulfilled some criteria based on cat1, cat2, cat3. And so I did something like this:
data1[, .( items = .N,
group1 = sum(grepl("^[1-8]{2}$", cat1)),
group2 = sum(grepl("^[1-8]9$", cat1)),
group3 = sum(grepl("^9[1-8]$", cat1)),
group4 = sum(cat1 == "99"))]
With the result being:
items group1 group2 group3 group4
1: 10 9 0 0 1
There were lots of other criteria included in the analysis but this is also just a sample. My requirements changed and now for every group that is specified I need to sum also the amounts. So I have two questions:
1) Is there a way in data table to do this summation in a similar way to the one for calculating the counts (so the basic idea would be something like sum(amounts)
where grepl("^[1-8]{2}$", cat1)
)
2) Is there an efficient way of doing this that I am missing? I cannot think of any good way of getting my result aside from adding new columns to the original dataset for each criterion that I have and then doing a filtered sum.
My ideal result would be:
items group1 group2 group3 group4 total_amounts group1_amounts group2_amounts group3_amounts group4_amounts
1: 10 9 0 0 1 1000 900 0 0 100
Upvotes: 1
Views: 501
Reputation: 66819
I'd say: make a table for the full mapping to groups (which are mutually exclusive):
m = data.table(g = paste0("group", 1:4))[,.(cat1 = as.character(
if (.GRP==1L) combn(1:8, 2, paste0, collapse = "") else
if (.GRP==2L) paste0(1:8, 9) else
if (.GRP==3L) paste0(9, 1:8) else
if (.GRP==4L) "99"
)), by=g]
Verify that the mapping is...
stopifnot(m[duplicated(cat1), .N == 0L]) # mutually exclusive
stopifnot(data1[!m, on=.(cat1), .N == 0L]) # exhaustive
Add group as a variable to the main table:
data1[m, on=.(cat1), g := i.g]
Join on groups to summarize by each group:
res = data1[.(g = unique(m$g)), on=.(g), .(.N, tot_amt = sum(as.numeric(amounts), na.rm=TRUE)), by=.EACHI]
# g N tot_amt
# 1: group1 9 900
# 2: group2 0 0
# 3: group3 0 0
# 4: group4 1 100
I think this is the more useful format for the output, but if you really want the one-line output...
cbind(N = sum(res$N), dcast(res, . ~ g, value.var=c("N","tot_amt")))[, !"."]
# N N_group1 N_group2 N_group3 N_group4 tot_amt_group1 tot_amt_group2 tot_amt_group3 tot_amt_group4
# 1: 10 9 0 0 1 900 0 0 100
How the "join" step works
The syntax is x[i, on=cols, j, by=.EACHI]
, where i
is a list or data.table.
.()
is an alias for list()
inside some arguments of x[...]
.by=.EACHI
means group by each row of i
(even for rows of i
unmatched in x
).j
is computed for each group determined in by=
.Comments
When making the group column, I think it's much better to make a table instead of using a series of regexes, like data1[grepl(yada), g := 1L][grepl(yada2) & is.na(g), g := 2L][grepl(yada3) & is.na(g), g := 3L]
since with the latter, there's no way to catch cases where there is duplicated assignment (cat1
assigned to multiple g
) or missed assignment (cat1
assigned to no g
).
And when summarizing by group, I think it's best to do data1[.(all_groups), on=.(g), ..., by=.EACHI]
rather than data1[, ..., by=g]
since the latter will skip groups that happen to not appear in the table.
Upvotes: 2