Giora Simchoni
Giora Simchoni

Reputation: 3689

R: Count categories in each column when not all categories appear

A simple data.frame with character columns:

df <- data.frame(x = c("a", "b", "c", "c"), y = c("a", "b", "b", "c"))

Suppose I wish to count the categories at each column, and fast, returning another data.frame. The following using map from purrr is elegant and works:

df %>%
  map(table) %>%
  Reduce(cbind, .) %>%
  data.frame() %>%
  set_names(c("x", "y"))

  x y
a 1 1
b 1 2
c 2 1

HOWEVER. What to do when not all categories appear in each column? Example:

df2 <- data.frame(x = c("a", "b", "b"), y = c("a", "a", "a"))

I would want the count for b in the y column to be 0. But I get:

df2 %>%
  map(table) %>%
  Reduce(cbind, .) %>%
  data.frame() %>%
  set_names(c("x", "y"))

  x y
a 1 3
b 2 3

Without even a warning! I'm guessing this is because of cbind's habbit of recycling elements of one column to match the length of another. I tried using qpcR:::cbind.na to at least get NA values for the missing categories which I can later convert to 0 but I get this error:

Error in matrix(, maxRow - nrow(x), ncol(x)) : 
  invalid 'ncol' value (too large or NA)

What's a great, fast solution, preferably from the tidyverse set of packages?

UPDATE:

For the first case where we know all categories are in all columns:

df %>% dmap(function(x) as.numeric(table(x)))

is probably much more elegant.

Upvotes: 1

Views: 692

Answers (1)

David Robinson
David Robinson

Reputation: 78600

You can use gather() and spread() from tidyr with dplyr's count() in the middle.

library(dplyr)
library(tidyr)

df2 <- data_frame(x = c("a", "b", "b"), y = c("a", "a", "a"))

df2 %>%
  gather(key, value) %>%
  count(key, value) %>%
  spread(key, n, fill = 0)

Result:

  value     x     y
* <chr> <dbl> <dbl>
1     a     1     3
2     b     2     0

The fill = 0 in spread() is what causes the b/y pair to be 0.

Upvotes: 1

Related Questions