Emman
Emman

Reputation: 4201

How to find intersection between all possible pairs of sets in a 2-column table?

I want to calculate an overlap coefficient between sets. My data comes as a 2-column table, such as:

df_example <- 
  tibble::tribble(~my_group, ~cities,
                   "foo",   "london",
                   "foo",   "paris", 
                   "foo",   "rome", 
                   "foo",   "tokyo",
                   "foo",   "oslo",
                   "bar",   "paris", 
                   "bar",   "nyc",
                   "bar",   "rome", 
                   "bar",   "munich",
                   "bar",   "warsaw",
                   "bar",   "sf", 
                   "baz",   "milano",
                   "baz",   "oslo",
                   "baz",   "sf",  
                   "baz",   "paris")

In df_example, I have 3 sets (i.e., foo, bar, baz), and members of each set are given in cities.

I would like to end up with a table that intersects all possible pairs of sets, and specifies the size of the smaller set in each pair. This will give rise to calculating an overlap coefficient for each pair of sets.

(Overlap coefficient = number of common members / size of smaller set)

Desired Output

## # A tibble: 3 × 4
##   combination n_instersected_members size_of_smaller_set  overlap_coeff
##   <chr>                        <dbl>               <dbl>          <dbl>
## 1 foo*bar                          2                   5           0.4 
## 2 foo*baz                          3                   4           0.75
## 3 bar*baz                          2                   4           0.5 

Is there a simple enough way to get this done with dplyr verbs? I've tried

df_example |> 
  group_by(my_group) |> 
  summarise(intersected = dplyr::intersect(cities))

But this won't work, obviously, because dplyr::intersect() expects two vectors. Is there a way to get to the desired output similar to my dplyr direction?

Upvotes: 5

Views: 145

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

Another way to organize the data would be to use a tabular form (we can use a sparse Matrix to save memory if needed):

#library(Matrix)
tab = xtabs( ~ cities + my_group, df_example, sparse = TRUE) 

Then, all other variables can be calculated as:

n_intersected_members = crossprod(tab)
size_of_smaller_set = outer(cs <- colSums(tab), cs, pmin)
overlap_coeff = n_intersected_members / size_of_smaller_set
#overlap_coeff
#3 x 3 Matrix of class "dsyMatrix"
#    bar baz foo
#bar 1.0 0.5 0.4
#baz 0.5 1.0 0.5
#foo 0.4 0.5 1.0 

And retrieve the lower.tri of each object if needed.

Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 101393

Here is a base R option using combn

do.call(
    rbind,
    combn(
        with(
            df_example,
            split(cities, my_group)
        ),
        2,
        \(x)
        transform(
            data.frame(
                combo = paste0(names(x), collapse = "-"),
                nrIntersect = sum(x[[1]] %in% x[[2]]),
                szSmallSet = min(lengths(x))
            ),
            olCoeff = nrIntersect / szSmallSet
        ),
        simplify = FALSE
    )
)

which gives

    combo nrIntersect szSmallSet olCoeff
1 bar-baz           2          4     0.5
2 bar-foo           2          5     0.4
3 baz-foo           2          4     0.5

Upvotes: 4

Related Questions