E.Rae
E.Rae

Reputation: 1

Trouble subsetting multiple data frames by corresponding lists

I am trying to extract regions from a chromosome/position data frame that correspond with a related list of chromosomal positions depicting histone locations. My current "pipeline":

import original dataset containing chromosome and position:

> mapinfo<- read.table()
> colnames(mapinfo)<-c("CHR","M")
> mapinfo$CHR<- paste("chr",mapinfo$CHR),sep="")
> head(mapinfo)
CHR         M
1  chrX  24072640
2  chr9 131463936
3 chr14 105176736
4 chr13 115000168
5  chr8  74791285
6 chr19   3676340

import bed file containing histone locations

> bed<-read.table()
> names(bed)<- c("Chr","Start","Stop")
> head(bed)
Chr     Start      Stop
1  chr4  76806896  76807598
2  chrY  10034763  10036639
3  chr2 133036421 133037716
4 chr21  27227897  27228500
5  chr1 145036931 145041607
6  chr2  91777964  91779762

Generate chromosome specific codes for use in subsetting from mapinfo data frame

> Mcodes<- by(bed,bed$Chr,function(x){paste("M>=",bed$Start,"&M<=",bed$Stop,sep="",collapse="|")})
> Mcodes[chr1]
chr1
"M>=130786932&M<=130787255|M>=133156512&M<=133156894..."

Subset original mapinfo dataset by chromosome:

> subs<- split(mapinfo,mapinfo$CHR)

At this point I can use the following line in order to subset my desired regions individually by chromosome:

> CHR1<- eval(parse(text=paste0('subset(subs$chr1,',Mcodes["chr1"],')')))

I would like to subset all chromosome specific data frames contained in "subs" by their chromosome specific corresponding list in "Mcodes" without having to run that last line of code 24 different times (as I have multiple bed files for various histones/histone variants that will eventually need to be put through the same pipeline). Is there a way to loop/apply/something to make that possible?

Sorry if it seems a trivial question - I am still very new to the R/programming game. Thanks for any advice.

Upvotes: 0

Views: 63

Answers (1)

Simon Jackson
Simon Jackson

Reputation: 3174

Here's a toy example that, if I've understood you correctly, should help:

Here's a toy mapinfo:

mapinfo <- data.frame(
  CHR = paste0("chr", rep(1:3, each = 3)),
  M   = c(1, 10, 100)
)
mapinfo
#>    CHR   M
#> 1 chr1   1
#> 2 chr1  10
#> 3 chr1 100
#> 4 chr2   1
#> 5 chr2  10
#> 6 chr2 100
#> 7 chr3   1
#> 8 chr3  10
#> 9 chr3 100

A toy bed:

bed <- data.frame(
  CHR   = paste0("chr", rep(1:3, each = 2)),
  Start = c(0, 5, 15),
  Stop  = c(5, 15, 120)
)
bed
#>    CHR Start Stop
#> 1 chr1     0    5
#> 2 chr1     5   15
#> 3 chr2    15  120
#> 4 chr2     0    5
#> 5 chr3     5   15
#> 6 chr3    15  120

Then, if you don't have them, install the packages dplyr and purrr (using install.packages(c("dplyr", "purrr")), and run the following:

library(dplyr)
library(purrr)

mapinfo %>%
  left_join(bed) %>%
  filter(pmap_lgl(list(M, Start, Stop), between))
#>    CHR   M Start Stop
#> 1 chr1   1     0    5
#> 2 chr1  10     5   15
#> 3 chr2   1     0    5
#> 4 chr2 100    15  120
#> 5 chr3  10     5   15
#> 6 chr3 100    15  120

This is a data frame of all elements in mapinfo that were between Start and Stop values specified in bed. At this point, you can go ahead and split it up as desired, remove Start and Stop columns, etc.

An important note: CHR must have exactly the same label in both data frames (e.g., all caps). I noticed that you had in all caps in one and just first-letter caps in another. Be consistent here (a) for this code to work and (b) because consistency makes your code easier to digest.

To help you develop your understanding of the code, here are some relevant links to "R for Data Science" by Garrett Grolemund and Hadley Wickham:

Upvotes: 1

Related Questions