Reputation: 633
I have a data.table that has factor column with empty levels. I need to get the row count and sums of other variables, all grouped by multiple factors, including the one with empty levels. My question is similar to this one, but here I need to count for multiple factors.
For example, let data.table be:
library('data.table')
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
I want to do the following:
dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
#Output is:
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
I want output include empty levels for v2 as well ("d" and "e"), like in table(dtr$v2,dtr$v3)
, so the final output should look like (the order doesn't matter):
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
7: d yes 0 0
8: d no 0 0
9: e yes 0 0
10: e no 0 0
I tried to use the method used in the link, but I'm not sure how to use joint J() function when there are multiple columns used.
This works fine for groupping by 1 column only:
setkey(dtr,v2)
dtr[J(levels(v2)),list(freq=.N,mm=sum(v1,na.rm=T))]
However, dtr[J(levels(v2),v3),list(freq=.N,mm=sum(v1,na.rm=T))]
doesn't include all combinations
Upvotes: 19
Views: 3436
Reputation: 11
The existing answers helped me a lot, but I needed a version that can be run programmatically; i.e. the columns to be combined may vary from case to case.
Here's what I did, in case someone needs it. Note that it requires the variables in the data.table to be factors:
library('data.table')
# create the data table
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=factor(sample(c("yes", "no"), 15, replace = TRUE)))
# select the columns to be used as key
cols <- c('v2', 'v3')
setkeyv(dtr, cols)
# get all combinations of the levels in selected column
keyobject <- expand.grid(lapply(dtr[, ..cols], levels))
# get summary for each combination
dtr[keyobject, .(freq=.N,mm=sum(v1,na.rm=T)), by = .EACHI]
Edit: You may also use do.call(CJ, lapply(dtr[, ..cols], levels))
instead of expand.grid
Upvotes: 1
Reputation: 132989
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
You can use CJ
(a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.
setkey(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]
# v2 v3 freq mm
# 1: a no 1 9
# 2: a yes 2 11
# 3: b no 2 11
# 4: b yes 3 23
# 5: c no 4 40
# 6: c yes 3 26
# 7: d no NA NA
# 8: d yes NA NA
# 9: e no NA NA
# 10: e yes NA NA
Upvotes: 18
Reputation: 109
table()
will also capture freq values that are zero. To get the "mm" column, you could do a basic join. For example,
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- as.data.table(dtr[,table(v2,v3)])
setnames(res,'N','freq')
setkey(res,v2,v3)
setkey(dtr,v2,v3)
res <- dtr[,.(mm=sum(v1,na.rm=TRUE)),by=c('v2','v3')][res]
I'm not sure how table()
benchmarks with cross join.
Upvotes: 1