Reputation: 817
I have a dataset where each row corresponds to a sample that was tested for the existence of specific drugs (one sample can have more than one drug). I am trying to find the most common drug combinations and I wanted to know if there is a better way to do it. This is an example of my dataset:
id = c(id1,id2,id3,id4,id5,id6,id7,id8)
d1 = c(1,1,0,1,0,1,0,1)
d2 = c(0,0,1,0,1,1,1,0)
d3 = c(1,0,1,1,0,1,0,1)
df = tibble(id, d1, d2, d3)
column id
corresponds to the id of the sample and the other columns are the drugs for which each sample was tested (in the original dataset I have 42 drugs/columns). 1 means Yes, 0 means No.
In order to get the number of combinations I did the following:
df %>% unite("tot", d1:d3, sep = "-", remove = F) %>%
group_by(tot) %>% summarise(n = n())
# A tibble: 5 x 2
tot n
<chr> <int>
1 0-1-0 2
2 0-1-1 1
3 1-0-0 1
4 1-0-1 3
5 1-1-1 1
Ok, now I know that combination 1-0-1
(d1 + d3) is the most common. That is relatively simple, taken into account that in the example I only have 3 drugs. The problem is when I do it for the 42 drugs and I end up with a huge string that I need to translate back.
Is there a more efficient way to do this? Thanks!
Upvotes: 1
Views: 41
Reputation: 8880
additional option
df %>%
pivot_longer(-id) %>%
filter(value != 0) %>%
group_by(id) %>%
summarise(name = str_c(name, collapse = ", ")) %>%
group_by(name) %>%
count() %>%
arrange(-n)
Upvotes: 1
Reputation: 39858
Using dplyr
, you can do:
df %>%
group_by_at(vars(-id)) %>%
count()
d1 d2 d3 n
<dbl> <dbl> <dbl> <int>
1 0 1 0 2
2 0 1 1 1
3 1 0 0 1
4 1 0 1 3
5 1 1 1 1
If you want the names of columns with ones from the n (here two) most frequent combinations, with the addition of tidyr
:
df %>%
group_by_at(vars(-id)) %>%
count() %>%
ungroup() %>%
top_n(2, wt = n) %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, n)) %>%
group_by(rowid, n) %>%
summarise(name = paste(name[value == 1], collapse = ", "))
rowid n name
<int> <int> <chr>
1 1 2 d2
2 2 3 d1, d3
Upvotes: 3