
Reputation: 441

Complete with all combinations after counting on data.table

I have data with some columns as factor and some as character. I want to count all combinations and write a function with data.table syntax

# Load libraries


# Create data

i_df = iris %>%
  filter(Species != 'virginica') %>%
    len   = ifelse(Sepal.Length > 6, 'large', 'tiny'),
    width = ifelse(Sepal.Width > 3, 'thick', 'thin'),
    color = ifelse(Species == 'setosa', 'green', 'red')
  ) %>% 
    len   = factor(len, levels = c('large', 'med_len', 'tiny')),
    width = factor(width, levels = c('thick', 'med_width', 'thin'))

This would be an example of my function:

myfun = function(d, g, mode) {
  # Convert to data.table  
  # Counting
  res = d[, .N, by = g]
  # Complete combinations
  setkeyv(res, cols = g)
  res = switch(
    manual = {
      res[CJ(levels(d$Species), levels(d$len), levels(d$width), unique(d$color)),]
    auto = {
      m = res[,, c(.SD, unique = TRUE)), .SDcols = g]
      res[m, on = g]
  # Add zero when NA
  res[] = 0
  # Return

How to run:

g_tmp = c('Species', 'len', 'width', 'color')

myfun(d = i_df, g = g_tmp, mode = 'manual')
myfun(d = i_df, g = g_tmp, mode = 'auto')

As you can see, I'm using setkeyv and not setkey, because I need use character vector g. But when complete with CJ, I cannot get it working with character vector input mode = 'auto'. There, indicate all factor levels for factors and all present colors unique for all character columns. As you can see, with mode = 'manual', 54 rows are returned, and with mode = 'auto', non-present factor levels are not returned, and result is 16 rows.

I've found this answer and this one but I cannot get it working when I have a mix of factor and character columns

As some colums are factors with some non-present levels, unique is not good here, only for the character columns

Upvotes: 1

Views: 299

Answers (1)

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6529

Here is one possible way to solve your problem. Note that the argument with=FALSE in the data.table context allows to select the columns using the standard data.frame rules. In the example below, I assumed that the columns used to compute all combinations are passed to myfun as a character vector. Keep in mind that no columns in your dataset should be named gcases. .EACHI in by allows to perform some operation for each row in i.

myfun = function(d, g) {
  # get levels (for factors) and unique values for other types. 
  fn <- function(x) if(is.factor(x)) levels(x) else unique(x)
  gcases <- lapply(setDT(d, key=g)[, g, with=FALSE], fn)
  # count based on all combinations
  d[, gcases), .N, keyby=.EACHI]

Upvotes: 2

Related Questions