user964689
user964689

Reputation: 822

Count number of shared observations between samples using dplyr

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible. I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.

example:

sample  start   end
a   2   4
a   3   6
a   4   8
b   2   4
b   3   6
b   10  12
c   10  12
c   0   4
c   2   4

Here samples a, b and c share 1 observation (2, 4) sample a and b share 2 observations (2 4, 3 6) sample b and c share 2 observations (2 4, 10 12) sample a and c share 1 observation (2 4)

I'd like an output like:

abc 1
ab 2
bc 2
ac 1 

and also to see what the shared observations are if possible:

abc 2 4
ab 2 4 
ab 3 6

etc

Thanks in advance

Upvotes: 0

Views: 249

Answers (2)

Sotos
Sotos

Reputation: 51592

Here is an idea via base R,

final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow), 
                      pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))

#      count1 pairs1
#0.4        1      c
#2.4        3    abc
#3.6        2     ab
#4.8        1      a
#10.12      2     bc

Upvotes: 1

talat
talat

Reputation: 70266

Here's something that should get you going:

df %>% 
  group_by(start, end) %>% 
  summarise(
    samples = paste(unique(sample), collapse = ""), 
    n = length(unique(sample)))

# Source: local data frame [5 x 4]
# Groups: start [?]
# 
#   start   end samples     n
#   <int> <int>   <chr> <int>
# 1     0     4       c     1
# 2     2     4     abc     3
# 3     3     6      ab     2
# 4     4     8       a     1
# 5    10    12      bc     2

Upvotes: 1

Related Questions