Reputation: 59
I have the following dataset, showing the INGREDIENTS contained in each PRODUCT;
data <- data.frame("PRODUCT" = c("Creme","Creme","Creme","Creme","Medoc","Medoc","Medoc","Medoc","Medoc","Hububu","Hububu","Hububu","Hububu","Troll","Troll","Troll","Troll","Suzuki","Suzuki","Gluglu","Gluglu","Gluglu"),
"INGREDIENT" = c("zeze","zaza","zozo","zuzu","zaza","sasa","haha","zuzu","zemzem","zaza","zuzu","zizi","haha","zozo","zaza","zemzem","zuzu","sasa","zuzu","ozam","zaza","hayda"))
I want to know the most common combinations of INGREDIENTS in each PRODUCT; which ingredient is associated with which other ingredient ? I applied the code I found in this thread here :
combinaisons_par_PRODUCT = data %>%
full_join(data, by="PRODUCT") %>%
group_by(INGREDIENT.x, INGREDIENT.y) %>%
summarise(n = length(unique(PRODUCT))) %>%
filter(INGREDIENT.x!=INGREDIENT.y) %>%
mutate(item = paste(INGREDIENT.x, INGREDIENT.y, sep=", "))
It works but there is one final flaw; I would like the order to be ignored. For instance, this code, would give me 1 association of HAHA and SASA, and also 1 association of SASA and HAHA. But for me, these are the same things. So I would like the code to ignore the order of INGREDIENTS and give me one unique association of 2 HAHA & SASA.
I tried sorting the INGREDIENTS before applying the code, but it didn't work either. Could someone help me please? How can I have these combinations unregarding the order ?
Thank you very much!
Upvotes: 1
Views: 114
Reputation: 101044
An igraph
option using graph_from_adjacency_matrix
library(igraph)
get.data.frame(
graph_from_adjacency_matrix(
crossprod(table(data)),
mode = "undirected",
weighted = TRUE
)
)
gives
from to weight
1 haha haha 2
2 haha sasa 1
3 haha zaza 2
4 haha zemzem 1
5 haha zizi 1
6 haha zuzu 2
7 hayda hayda 1
8 hayda ozam 1
9 hayda zaza 1
10 ozam ozam 1
11 ozam zaza 1
12 sasa sasa 2
13 sasa zaza 1
14 sasa zemzem 1
15 sasa zuzu 2
16 zaza zaza 5
17 zaza zemzem 2
18 zaza zeze 1
19 zaza zizi 1
20 zaza zozo 2
21 zaza zuzu 4
22 zemzem zemzem 2
23 zemzem zozo 1
24 zemzem zuzu 2
25 zeze zeze 1
26 zeze zozo 1
27 zeze zuzu 1
28 zizi zizi 1
29 zizi zuzu 1
30 zozo zozo 2
31 zozo zuzu 2
32 zuzu zuzu 5
Upvotes: 1
Reputation: 886938
We could use base R
m1 <- crossprod(table(data))
subset(as.data.frame.table(m1 * lower.tri(m1, diag = TRUE)), Freq != 0)
EDIT: Comments from @ThomasIsCoding
Upvotes: 0
Reputation: 66415
Does this do what you want? I'm limiting to only situations where the combos are in alphabetical order, avoiding double counts.
data %>%
full_join(data, by="PRODUCT") %>%
filter(INGREDIENT.x < INGREDIENT.y) %>%
count(combo = paste(INGREDIENT.x, INGREDIENT.y, sep = ", "))
Upvotes: 2