Reputation: 1448
Let's consider the following data frame:
set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
val1 = 1:12,
val2 = rnorm(12, 10, 15))
The contingency table is as follows:
cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))
cont_tab
col2
col1 A B C
A 4 0 0
B 1 3 0
C 1 0 3
As you can see some pairs didn't occur: (A,B), (A,C), (B,C), (C,B). The end goal of my analysis is to list all of the pairs (in this case 9) and show a statistic for each of them. While using dplyr::group_by()
function I hit a limitation. Namely, the dplyr::group_by()
considers only existing pairs (pairs that occured at least once):
data %>%
group_by(col1, col2) %>%
summarize(stat = sum(val2) - sum(val1))
# A tibble: 5 x 3
# Groups: col1 [?]
col1 col2 stat
<fct> <fct> <dbl>
1 A A 58.1
2 B A -16.4
3 B B 17.0
4 C A -12.9
5 C C -41.9
The output I have in mind has 9 rows (4 of which has stat
equal to 0). Is it doable in dplyr
?
EDIT: Sorry for being too vague at the beginning. The real problem is more complex than counting the number of times a particular pair occurs. I added the new data in order to make the real problem more visible.
Upvotes: 4
Views: 3393
Reputation: 40141
Also a tidyverse
possibility using tidyr::complete()
:
data %>%
group_by_all() %>%
add_count() %>%
complete(col1, col2, fill = list(n = 0)) %>%
distinct()
col1 col2 n
<fct> <fct> <dbl>
1 A A 4
2 A B 0
3 A C 0
4 B A 1
5 B B 3
6 B C 0
7 C A 1
8 C B 0
9 C C 3
Or using tidyr::expand()
:
data %>%
count(col1, col2) %>%
right_join(data %>%
expand(col1, col2), by = c("col1" = "col1",
"col2" = "col2")) %>%
replace_na(list(n = 0))
Or using tidyr::crossing()
:
data %>%
count(col1, col2) %>%
right_join(crossing(col1 = unique(data$col1),
col2 = unique(data$col2)), by = c("col1" = "col1",
"col2" = "col2")) %>%
replace_na(list(n = 0))
Upvotes: 1
Reputation: 114
Here is a little workaround, I hope it works for you. Merge your table with table of all combinations and replace NAs with 0.
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
merge(unique(expand.grid(data)), by=c("col1","col2"), all=T) %>%
replace_na(list(stat=0))
Upvotes: 0
Reputation: 28695
You can use tidyr::complete
library(tidyverse)
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
# additions below
ungroup %>%
complete(col1, col2, fill = list(stat = 0))
# # A tibble: 9 x 3
# col1 col2 stat
# <chr> <chr> <dbl>
# 1 A A 4
# 2 A B 0
# 3 A C 0
# 4 B A 1
# 5 B B 3
# 6 B C 0
# 7 C A 1
# 8 C B 0
# 9 C C 3
You can also use count
for the first part. The code below gives the same output as the code above
data %>%
count(col1, col2) %>%
complete(col1, col2, fill = list(n = 0))
Upvotes: 2
Reputation: 887541
It is much easier to add spread
from tidyr
to get the same result as with table
library(dplyr)
library(tidyr)
count(data, col1, col2) %>%
spread(col2, n, fill = 0)
# A tibble: 3 x 4
# Groups: col1 [3]
# col1 A B C
# <fct> <dbl> <dbl> <dbl>
#1 A 4 0 0
#2 B 1 3 0
#3 C 1 0 3
NOTE: The group_by/summarise
step is changed to count
here
As @divibisan suggested, if the OP wanted long format, then add gather
at the end
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
spread(col2, stat, fill = 0) %>%
gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups: col1 [3]
# col1 col2 stat
# <fct> <chr> <dbl>
#1 A A 4
#2 B A 1
#3 C A 1
#4 A B 0
#5 B B 3
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 3
With the updated data in OP's post
data %>%
group_by(col1, col2) %>%
summarize(stat = sum(val2) - sum(val1)) %>%
spread(col2, stat, fill = 0) %>%
gather(col2, stat, -1)
# A tibble: 9 x 3
# Groups: col1 [3]
# col1 col2 stat
# <fct> <chr> <dbl>
#1 A A 7.76
#2 B A -20.8
#3 C A 6.97
#4 A B 0
#5 B B 28.8
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 9.56
Upvotes: 7
Reputation: 26353
This is doable even without dplyr
as.data.frame(table(data$col1, data$col2, dnn = c("col1", "col2")))
# col1 col2 Freq
#1 A A 4
#2 B A 1
#3 C A 1
#4 A B 0
#5 B B 3
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 3
Upvotes: 4