Reputation: 606
I have a dataframe like so:
df<-structure(list(id = c("A", "A", "A", "B", "B", "C", "C", "D",
"D", "E", "E"), expertise = c("r", "python", "julia", "python",
"r", "python", "julia", "python", "julia", "r", "julia")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("id",
"expertise"), spec = structure(list(cols = structure(list(id = structure(list(), class = c("collector_character",
"collector")), expertise = structure(list(), class = c("collector_character",
"collector"))), .Names = c("id", "expertise")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
df
id expertise
1 A r
2 A python
3 A julia
4 B python
5 B r
6 C python
7 C julia
8 D python
9 D julia
10 E r
11 E julia
I can get the overall counts of "expertise" by using:
library(dplyr)
df %>% group_by(expertise) %>% mutate (counts_overall= n())
However what I want is the counts for combinations of expertise values. In other words how many "id" had the same combination of two expertise e.g. "r" and"julia"? Here is a desired output:
df_out<-structure(list(expertise1 = c("r", "r", "python"), expertise2 = c("python",
"julia", "julia"), count = c(2L, 2L, 3L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("expertise1",
"expertise2", "count"), spec = structure(list(cols = structure(list(
expertise1 = structure(list(), class = c("collector_character",
"collector")), expertise2 = structure(list(), class = c("collector_character",
"collector")), count = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("expertise1", "expertise2", "count"
)), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
df_out
expertise1 expertise2 count
1 r python 2
2 r julia 2
3 python julia 3
Upvotes: 2
Views: 289
Reputation: 42544
The linked answer from latemail's comment creates a matrix
crossprod(table(df) > 0)
expertise expertise julia python r julia 4 3 2 python 3 4 2 r 2 2 3
while the OP expects a dataframe in long format.
Below is a data.table
solution which uses the CJ()
(cross join) function:
library(data.table)
setDT(df)[, CJ(expertise, expertise)[V1 < V2], by = id][
, .N, by = .(expertise1 = V1, expertise2 = V2)]
expertise1 expertise2 N 1: julia python 3 2: julia r 2 3: python r 2
CJ(expertise, expertise)[V1 < V2]
is the data.table
equivalent for t(combn(df$expertise, 2))
or combinat::combn2(df$expertise)
.
Here is another variant which uses a self-join:
library(data.table)
setDT(df)[df, on = "id", allow = TRUE][
expertise < i.expertise, .N, by = .(expertise1 = expertise, expertise2 = i.expertise)]
expertise1 expertise2 N 1: python r 2 2: julia r 2 3: julia python 3
Upvotes: 4
Reputation: 17289
A solution not as efficient as crossprod-table approach but easy to understand:
library(tidyr)
df %>% group_by(id) %>%
summarize(expertise = list(combn(sort(expertise), 2, FUN = paste, collapse = '_'))) %>%
unnest(expertise) %>%
group_by(expertise) %>%
summarize(count = n()) %>%
separate(expertise, c('expertise1', 'expertise2'), sep = '_')
# # A tibble: 3 x 3
# expertise1 expertise2 count
# <chr> <chr> <int>
# 1 julia python 3
# 2 julia r 2
# 3 python r 2
Upvotes: 2