Daniel
Daniel

Reputation: 395

Count shared values

Problem

I have a data frame (df), who looks like this:

V1      V2
a1      a4
a1      a5
a2      a4
a2      a6
a3      a4
a4      a5 
a2      a5

For each pair of unique values in V1, I want to calculate the number of values they share in V2.

For example, consider a1 and a2 in V1. They both have a4 and a5 in V2. Thus, for the a1-a2 pair, the result is 2 shared observations (in column W3). And so on for all other pairs which share values:

W1    W2    W3
a1    a2    2
a1    a3    1
a1    a4    1
a1    a3    1
a2    a3    1

Some idea how can I do it?

Upvotes: 2

Views: 92

Answers (4)

Henrik
Henrik

Reputation: 67778

tcrossprod(table(d))
#    V1
# V1   a1 a2 a3 a4
#   a1  2  2  1  1
#   a2  2  3  1  1
#   a3  1  1  1  0
#   a4  1  1  0  1

If desired, grab the upper (or lower) triangle, e.g.

ix = which(upper.tri(m), arr.ind = TRUE)
data.frame(v1 = dimnames(m)[[1]][ix[, 1]], v2 = dimnames(m)[[2]][ix[, 2]], n = m[ix])

Or treat as a graph problem:

library(igraph)
g = graph_from_data_frame(d)
bibcoupling(g)
#    a1 a2 a3 a4 a5 a6
# a1  0  2  1  1  0  0
# a2  2  0  1  1  0  0
# a3  1  1  0  0  0  0
# a4  1  1  0  0  0  0
# a5  0  0  0  0  0  0
# a6  0  0  0  0  0  0


plot(g)

enter image description here

Upvotes: 3

akrun
akrun

Reputation: 886948

Here is one option with combn. We split the 'V2' by 'V1' column, then get the combnatios of pairwise elements of list to return the length of intersecting elements, name the list elements with the pairwise concatenation of names of 'lst1', stack it to a two column data.frame, separate the concatenated column into two and rename if needed

lst1 <- split(df1$V2, df1$V1)
outlst <- combn(lst1, 2, FUN = function(x) 
         length(Reduce(intersect, x)), simplify = FALSE) 
names(outlst) <- combn(names(lst1), 2, FUN = paste, collapse = "_")

library(dplyr)
library(tidyr)
subset(stack(outlst), values >0)[2:1] %>% 
    separate(ind, into = c('W1', 'W2')) %>% 
    rename(W3 = values)
#  W1 W2 W3
#1 a1 a2  2
#2 a1 a3  1
#3 a1 a4  1
#4 a2 a3  1
#5 a2 a4  1

Or using tidyverse

library(dplyr)
library(tidyr)
df1 %>%
   group_by(V2) %>%
   summarise(V1 = if(n() > 1) combn(V1, 2, FUN = paste, collapse='_') 
         else NA_character_) %>%
   ungroup %>%
   filter(!is.na(V1)) %>%
   separate(V1, into = c('W1', 'W2')) %>% 
   count(W1, W2)
# A tibble: 5 x 3
#  W1    W2        n
#  <chr> <chr> <int>
#1 a1    a2        2
#2 a1    a3        1
#3 a1    a4        1
#4 a2    a3        1
#5 a4    a2        1

data

df1 <- structure(list(V1 = c("a1", "a1", "a2", "a2", "a3", "a4", "a2"
), V2 = c("a4", "a5", "a4", "a6", "a4", "a5", "a5")), class = "data.frame", 
row.names = c(NA, 
-7L))

Upvotes: 1

alex_jwb90
alex_jwb90

Reputation: 1713

Just join df onto itself by V2 column, count the cooccurrence of V1 items and filter the result, so that W1 < W2, since it is symmetrical anyway...

library(solar)
library(stringr)

result <- df %>%
  left_join(df, by = "V2") %>%
  count(W1 = V1.x, W2 = V1.y, name = "W3") %>%
  filter(W1 < W2)

Upvotes: 3

Onyambu
Onyambu

Reputation: 79198

In base R you could do:

a <- table(unlist(aggregate(.~V2, df, 
         function(x)if(length(x)>1) combn(x, 2, paste, collapse = " "))[,2]))
read.table(text=paste(names(a),a))
  V1 V2 V3
1 a1 a2  2
2 a1 a3  1
3 a1 a4  1
4 a2 a3  1
5 a4 a2  1

Upvotes: 1

Related Questions