Reputation: 441
I have data with some columns as factor and some as character. I want to count all combinations and write a function with data.table syntax
# Load libraries
library(dplyr)
library(data.table)
# Create data
i_df = iris %>%
filter(Species != 'virginica') %>%
mutate(
len = ifelse(Sepal.Length > 6, 'large', 'tiny'),
width = ifelse(Sepal.Width > 3, 'thick', 'thin'),
color = ifelse(Species == 'setosa', 'green', 'red')
) %>%
mutate(
len = factor(len, levels = c('large', 'med_len', 'tiny')),
width = factor(width, levels = c('thick', 'med_width', 'thin'))
)
This would be an example of my function:
myfun = function(d, g, mode) {
# Convert to data.table
setDT(d)
# Counting
res = d[, .N, by = g]
# Complete combinations
setkeyv(res, cols = g)
res = switch(
mode,
manual = {
res[CJ(levels(d$Species), levels(d$len), levels(d$width), unique(d$color)),]
},
auto = {
m = res[, do.call(CJ, c(.SD, unique = TRUE)), .SDcols = g]
res[m, on = g]
}
)
# Add zero when NA
res[is.na(res)] = 0
# Return
return(res)
}
How to run:
g_tmp = c('Species', 'len', 'width', 'color')
myfun(d = i_df, g = g_tmp, mode = 'manual')
myfun(d = i_df, g = g_tmp, mode = 'auto')
As you can see, I'm using setkeyv
and not setkey
, because I need use character vector g
. But when complete with CJ
, I cannot get it working with character vector input mode = 'auto'
. There, indicate all factor levels for factors and all present colors unique
for all character columns. As you can see, with mode = 'manual'
, 54 rows are returned, and with mode = 'auto'
, non-present factor levels are not returned, and result is 16 rows.
I've found this answer and this one but I cannot get it working when I have a mix of factor and character columns
As some colums are factors with some non-present levels, unique
is not good here, only for the character columns
Upvotes: 1
Views: 299
Reputation: 6529
Here is one possible way to solve your problem. Note that the argument with=FALSE
in the data.table
context allows to select the columns using the standard data.frame
rules. In the example below, I assumed that the columns used to compute all combinations are passed to myfun
as a character vector.
Keep in mind that no columns in your dataset should be named gcases. .EACHI
in by
allows to perform some operation for each row in i
.
myfun = function(d, g) {
# get levels (for factors) and unique values for other types.
fn <- function(x) if(is.factor(x)) levels(x) else unique(x)
gcases <- lapply(setDT(d, key=g)[, g, with=FALSE], fn)
# count based on all combinations
d[do.call(CJ, gcases), .N, keyby=.EACHI]
}
Upvotes: 2