Bruno Guarita
Bruno Guarita

Reputation: 817

Find number of combinations

I have a dataset where each row corresponds to a sample that was tested for the existence of specific drugs (one sample can have more than one drug). I am trying to find the most common drug combinations and I wanted to know if there is a better way to do it. This is an example of my dataset:

    id = c(id1,id2,id3,id4,id5,id6,id7,id8)
    d1 = c(1,1,0,1,0,1,0,1)
    d2 = c(0,0,1,0,1,1,1,0)
    d3 = c(1,0,1,1,0,1,0,1)

    df = tibble(id, d1, d2, d3)

column id corresponds to the id of the sample and the other columns are the drugs for which each sample was tested (in the original dataset I have 42 drugs/columns). 1 means Yes, 0 means No.

In order to get the number of combinations I did the following:

df %>% unite("tot", d1:d3, sep = "-", remove = F) %>%
  group_by(tot) %>% summarise(n = n())

# A tibble: 5 x 2
  tot       n
  <chr> <int>
1 0-1-0     2
2 0-1-1     1
3 1-0-0     1
4 1-0-1     3
5 1-1-1     1

Ok, now I know that combination 1-0-1 (d1 + d3) is the most common. That is relatively simple, taken into account that in the example I only have 3 drugs. The problem is when I do it for the 42 drugs and I end up with a huge string that I need to translate back.

Is there a more efficient way to do this? Thanks!

Upvotes: 1

Views: 41

Answers (2)

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

additional option

df %>% 
  pivot_longer(-id) %>% 
  filter(value != 0) %>% 
  group_by(id) %>% 
  summarise(name = str_c(name, collapse = ", ")) %>% 
  group_by(name) %>% 
  count() %>% 
  arrange(-n)

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

Using dplyr, you can do:

df %>%
 group_by_at(vars(-id)) %>%
 count()

     d1    d2    d3     n
  <dbl> <dbl> <dbl> <int>
1     0     1     0     2
2     0     1     1     1
3     1     0     0     1
4     1     0     1     3
5     1     1     1     1

If you want the names of columns with ones from the n (here two) most frequent combinations, with the addition of tidyr:

df %>%
 group_by_at(vars(-id)) %>%
 count() %>%
 ungroup() %>%
 top_n(2, wt = n) %>%
 rowid_to_column() %>%
 pivot_longer(-c(rowid, n)) %>%
 group_by(rowid, n) %>%
 summarise(name = paste(name[value == 1], collapse = ", "))

  rowid     n name  
  <int> <int> <chr> 
1     1     2 d2    
2     2     3 d1, d3

Upvotes: 3

Related Questions