luca tucciarone
luca tucciarone

Reputation: 111

Counting rows of Dataframe filtering by combinations of levels

I've this data frame (which is the output of multibedintersect between 8 different Bed files of my ChIp-seq data):

    head(Table,)
    chrom   start     end num  list
2   chr1 4491607 4493602   2   6,7
6   chr1 4571540 4571826   2   7,8
15  chr1 5019126 5020672   2   2,7
21  chr1 7139275 7139745   3 4,6,7
23  chr1 7398185 7398658   2   7,8
28  chr1 9745462 9745912   4 1,4,6,7

The column "list" is a character string that represents the presence of that particular peak in the list of my samples.

For example, the peak "2" is found in either sample number 6 and 7.

I want to count how many times every combination of 2 samples are found in the dataset, creating a table that summaries the information.

So basically multibedintersect gives back too many overlaps. I'm just interested in how the samples overlap with each-other 2 at the time.

For example, the samples 6 and 7 are found in either peak 2,21,28 and the samples 4 and 6 are found in the peaks 21 and 28

Via the package tydiverse, I'm able to address the issue for 1 sample at the time but I'm not able to "make it cycle" for every combination.

     Table %>%
  filter(str_detect(list, "6,7"))

In this way I get back anything that has that combination:

   chrom   start     end num  list
2   chr1 4491607 4493602   2   6,7
21  chr1 7139275 7139745   3 4,6,7
28  chr1 9745462 9745912   4 1,4,6,7

I think this is underperforming and very script intensive, as I would need to manually filter for every combination: To name a few:

Doing this "my way" would be something horrible like this:

Counts <- NULL
Pippo <- Table %>%
  filter(str_detect(list, "7,8"))
Counts <- cbind(nrow(Pippo))

Pippo <- Table %>%
  filter(str_detect(list, "6,8"))
Counts <- cbind(Counts, nrow(Pippo))

Pippo <- Table %>%
  filter(str_detect(list, "5,8"))
Counts <- cbind(Counts, nrow(Pippo))

Pippo <- Table %>%
  filter(str_detect(list, "4,8"))
Counts <- cbind(Counts, nrow(Pippo))

Pippo <- Table %>%
  filter(str_detect(list, "3,8"))
Counts <- cbind(Counts, nrow(Pippo))

Pippo <- Table %>%
  filter(str_detect(list, "2,8"))
Counts <- cbind(Counts, nrow(Pippo))

Pippo <- Table %>%
  filter(str_detect(list, "1,8"))
Counts <- cbind(Counts, nrow(Pippo))

Would you please suggest me a better way to count every combination and create this data frame of summary?

Thanks a Lot

Upvotes: 1

Views: 69

Answers (1)

Parfait
Parfait

Reputation: 107652

Consider base R with two sapply calls: one with combn to build all pair strings and then another with grepl for subsetting data frame to retrieve row counts:

pairs <- sapply(combn(1:8, 2, simplify=FALSE), function(i) paste(i, collapse=","))

Counts <- sapply(pairs, function(i) nrow(subset(Table, grepl(i, `list`))))

Counts
# 1,2 1,3 1,4 1,5 1,6 1,7 1,8 2,3 2,4 2,5 2,6 2,7 2,8 3,4 3,5 3,6 3,7 3,8 4,5 4,6 
#   0   0   1   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2 
# 4,7 4,8 5,6 5,7 5,8 6,7 6,8 7,8 
#   0   0   0   0   0   3   0   2 

Alternatively, with a tidy version (dplyr + purrr):

pairs <- combn(1:8, 2, simplify=FALSE) %>% 
  map(~(paste(., collapse=","))) %>%
  unlist()

Counts <- pairs %>% 
  map(~(filter(Table, str_detect(list, .)) %>% nrow)) %>%
  setNames(pairs) %>%
  unlist()

Counts
# 1,2 1,3 1,4 1,5 1,6 1,7 1,8 2,3 2,4 2,5 2,6 2,7 2,8 3,4 3,5 3,6 3,7 3,8 4,5 4,6 
#   0   0   1   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2 
# 4,7 4,8 5,6 5,7 5,8 6,7 6,8 7,8 
#   0   0   0   0   0   3   0   2

Upvotes: 1

Related Questions