balkon16
balkon16

Reputation: 1448

R dplyr's group_by consider empty groups as well

Let's consider the following data frame:

set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
                   col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
                   val1 = 1:12,
                   val2 = rnorm(12, 10, 15))

The contingency table is as follows:

cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))

cont_tab

    col2
col1 A B C
   A 4 0 0
   B 1 3 0
   C 1 0 3

As you can see some pairs didn't occur: (A,B), (A,C), (B,C), (C,B). The end goal of my analysis is to list all of the pairs (in this case 9) and show a statistic for each of them. While using dplyr::group_by() function I hit a limitation. Namely, the dplyr::group_by() considers only existing pairs (pairs that occured at least once):

data %>%
  group_by(col1, col2) %>%
  summarize(stat = sum(val2) - sum(val1))

# A tibble: 5 x 3
# Groups:   col1 [?]
  col1  col2   stat
  <fct> <fct> <dbl>
1 A     A      58.1
2 B     A     -16.4
3 B     B      17.0
4 C     A     -12.9
5 C     C     -41.9

The output I have in mind has 9 rows (4 of which has stat equal to 0). Is it doable in dplyr?

EDIT: Sorry for being too vague at the beginning. The real problem is more complex than counting the number of times a particular pair occurs. I added the new data in order to make the real problem more visible.

Upvotes: 4

Views: 3393

Answers (5)

tmfmnk
tmfmnk

Reputation: 40141

Also a tidyverse possibility using tidyr::complete():

data %>% 
 group_by_all() %>%
 add_count() %>%
 complete(col1, col2, fill = list(n = 0)) %>%
 distinct()

  col1  col2      n
  <fct> <fct> <dbl>
1 A     A         4
2 A     B         0
3 A     C         0
4 B     A         1
5 B     B         3
6 B     C         0
7 C     A         1
8 C     B         0
9 C     C         3

Or using tidyr::expand():

data %>% 
 count(col1, col2) %>%
 right_join(data %>%
            expand(col1, col2), by = c("col1" = "col1",
                                       "col2" = "col2")) %>%
 replace_na(list(n = 0))

Or using tidyr::crossing():

data %>%
 count(col1, col2) %>%
 right_join(crossing(col1 = unique(data$col1), 
                     col2 = unique(data$col2)), by = c("col1" = "col1",
                                                       "col2" = "col2")) %>%
 replace_na(list(n = 0))

Upvotes: 1

swaps1
swaps1

Reputation: 114

Here is a little workaround, I hope it works for you. Merge your table with table of all combinations and replace NAs with 0.

data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>% 
merge(unique(expand.grid(data)), by=c("col1","col2"), all=T) %>% 
replace_na(list(stat=0))

Upvotes: 0

IceCreamToucan
IceCreamToucan

Reputation: 28695

You can use tidyr::complete

library(tidyverse)

data %>%
  group_by(col1, col2) %>%
  summarize(stat = n()) %>% 
  # additions below
  ungroup %>% 
  complete(col1, col2, fill = list(stat = 0))

# # A tibble: 9 x 3
#   col1  col2   stat
#   <chr> <chr> <dbl>
# 1 A     A         4
# 2 A     B         0
# 3 A     C         0
# 4 B     A         1
# 5 B     B         3
# 6 B     C         0
# 7 C     A         1
# 8 C     B         0
# 9 C     C         3

You can also use count for the first part. The code below gives the same output as the code above

data %>%
  count(col1, col2) %>%
  complete(col1, col2, fill = list(n = 0)) 

Upvotes: 2

akrun
akrun

Reputation: 887541

It is much easier to add spread from tidyr to get the same result as with table

library(dplyr)
library(tidyr)
count(data, col1, col2) %>% 
      spread(col2, n, fill = 0)
# A tibble: 3 x 4
# Groups:   col1 [3]
#  col1      A     B     C
#  <fct> <dbl> <dbl> <dbl>
#1 A         4     0     0
#2 B         1     3     0
#3 C         1     0     3

NOTE: The group_by/summarise step is changed to count here

As @divibisan suggested, if the OP wanted long format, then add gather at the end

data %>%
   group_by(col1, col2) %>%
   summarize(stat = n()) %>%
   spread(col2, stat, fill = 0) %>%
   gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2   stat
#  <fct> <chr> <dbl>
#1 A     A         4
#2 B     A         1
#3 C     A         1
#4 A     B         0
#5 B     B         3
#6 C     B         0
#7 A     C         0
#8 B     C         0
#9 C     C         3

Update

With the updated data in OP's post

data %>%
   group_by(col1, col2) %>%
   summarize(stat = sum(val2) - sum(val1)) %>% 
   spread(col2, stat, fill = 0)  %>% 
   gather(col2, stat, -1)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2    stat
#  <fct> <chr>  <dbl>
#1 A     A       7.76
#2 B     A     -20.8 
#3 C     A       6.97
#4 A     B       0   
#5 B     B      28.8 
#6 C     B       0   
#7 A     C       0   
#8 B     C       0   
#9 C     C       9.56

Upvotes: 7

markus
markus

Reputation: 26353

This is doable even without dplyr

as.data.frame(table(data$col1, data$col2, dnn = c("col1", "col2")))
#  col1 col2 Freq
#1    A    A    4
#2    B    A    1
#3    C    A    1
#4    A    B    0
#5    B    B    3
#6    C    B    0
#7    A    C    0
#8    B    C    0
#9    C    C    3

Upvotes: 4

Related Questions