Reputation: 395
I have a data frame (df), who looks like this:
V1 V2
a1 a4
a1 a5
a2 a4
a2 a6
a3 a4
a4 a5
a2 a5
For each pair of unique values in V1, I want to calculate the number of values they share in V2.
For example, consider a1
and a2
in V1. They both have a4
and a5
in V2. Thus, for the a1
-a2
pair, the result is 2
shared observations (in column W3
). And so on for all other pairs which share values:
W1 W2 W3
a1 a2 2
a1 a3 1
a1 a4 1
a1 a3 1
a2 a3 1
Some idea how can I do it?
Upvotes: 2
Views: 92
Reputation: 67778
tcrossprod(table(d))
# V1
# V1 a1 a2 a3 a4
# a1 2 2 1 1
# a2 2 3 1 1
# a3 1 1 1 0
# a4 1 1 0 1
If desired, grab the upper (or lower) triangle, e.g.
ix = which(upper.tri(m), arr.ind = TRUE)
data.frame(v1 = dimnames(m)[[1]][ix[, 1]], v2 = dimnames(m)[[2]][ix[, 2]], n = m[ix])
Or treat as a graph problem:
library(igraph)
g = graph_from_data_frame(d)
bibcoupling(g)
# a1 a2 a3 a4 a5 a6
# a1 0 2 1 1 0 0
# a2 2 0 1 1 0 0
# a3 1 1 0 0 0 0
# a4 1 1 0 0 0 0
# a5 0 0 0 0 0 0
# a6 0 0 0 0 0 0
plot(g)
Upvotes: 3
Reputation: 886948
Here is one option with combn
. We split
the 'V2' by 'V1' column, then get the combn
atios of pairwise elements of list
to return the length
of intersect
ing elements, name
the list
elements with the pairwise concatenation of names of 'lst1', stack
it to a two column data.frame, separate
the concatenated column into two and rename
if needed
lst1 <- split(df1$V2, df1$V1)
outlst <- combn(lst1, 2, FUN = function(x)
length(Reduce(intersect, x)), simplify = FALSE)
names(outlst) <- combn(names(lst1), 2, FUN = paste, collapse = "_")
library(dplyr)
library(tidyr)
subset(stack(outlst), values >0)[2:1] %>%
separate(ind, into = c('W1', 'W2')) %>%
rename(W3 = values)
# W1 W2 W3
#1 a1 a2 2
#2 a1 a3 1
#3 a1 a4 1
#4 a2 a3 1
#5 a2 a4 1
Or using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(V2) %>%
summarise(V1 = if(n() > 1) combn(V1, 2, FUN = paste, collapse='_')
else NA_character_) %>%
ungroup %>%
filter(!is.na(V1)) %>%
separate(V1, into = c('W1', 'W2')) %>%
count(W1, W2)
# A tibble: 5 x 3
# W1 W2 n
# <chr> <chr> <int>
#1 a1 a2 2
#2 a1 a3 1
#3 a1 a4 1
#4 a2 a3 1
#5 a4 a2 1
df1 <- structure(list(V1 = c("a1", "a1", "a2", "a2", "a3", "a4", "a2"
), V2 = c("a4", "a5", "a4", "a6", "a4", "a5", "a5")), class = "data.frame",
row.names = c(NA,
-7L))
Upvotes: 1
Reputation: 1713
Just join df
onto itself by V2 column, count the cooccurrence of V1 items and filter the result, so that W1 < W2, since it is symmetrical anyway...
library(solar)
library(stringr)
result <- df %>%
left_join(df, by = "V2") %>%
count(W1 = V1.x, W2 = V1.y, name = "W3") %>%
filter(W1 < W2)
Upvotes: 3
Reputation: 79198
In base R you could do:
a <- table(unlist(aggregate(.~V2, df,
function(x)if(length(x)>1) combn(x, 2, paste, collapse = " "))[,2]))
read.table(text=paste(names(a),a))
V1 V2 V3
1 a1 a2 2
2 a1 a3 1
3 a1 a4 1
4 a2 a3 1
5 a4 a2 1
Upvotes: 1