Reputation: 1
I am trying to extract regions from a chromosome/position data frame that correspond with a related list of chromosomal positions depicting histone locations. My current "pipeline":
> mapinfo<- read.table()
> colnames(mapinfo)<-c("CHR","M")
> mapinfo$CHR<- paste("chr",mapinfo$CHR),sep="")
> head(mapinfo)
CHR M
1 chrX 24072640
2 chr9 131463936
3 chr14 105176736
4 chr13 115000168
5 chr8 74791285
6 chr19 3676340
> bed<-read.table()
> names(bed)<- c("Chr","Start","Stop")
> head(bed)
Chr Start Stop
1 chr4 76806896 76807598
2 chrY 10034763 10036639
3 chr2 133036421 133037716
4 chr21 27227897 27228500
5 chr1 145036931 145041607
6 chr2 91777964 91779762
> Mcodes<- by(bed,bed$Chr,function(x){paste("M>=",bed$Start,"&M<=",bed$Stop,sep="",collapse="|")})
> Mcodes[chr1]
chr1
"M>=130786932&M<=130787255|M>=133156512&M<=133156894..."
> subs<- split(mapinfo,mapinfo$CHR)
At this point I can use the following line in order to subset my desired regions individually by chromosome:
> CHR1<- eval(parse(text=paste0('subset(subs$chr1,',Mcodes["chr1"],')')))
I would like to subset all chromosome specific data frames contained in "subs" by their chromosome specific corresponding list in "Mcodes" without having to run that last line of code 24 different times (as I have multiple bed files for various histones/histone variants that will eventually need to be put through the same pipeline). Is there a way to loop/apply/something to make that possible?
Sorry if it seems a trivial question - I am still very new to the R/programming game. Thanks for any advice.
Upvotes: 0
Views: 63
Reputation: 3174
Here's a toy example that, if I've understood you correctly, should help:
Here's a toy mapinfo
:
mapinfo <- data.frame(
CHR = paste0("chr", rep(1:3, each = 3)),
M = c(1, 10, 100)
)
mapinfo
#> CHR M
#> 1 chr1 1
#> 2 chr1 10
#> 3 chr1 100
#> 4 chr2 1
#> 5 chr2 10
#> 6 chr2 100
#> 7 chr3 1
#> 8 chr3 10
#> 9 chr3 100
A toy bed
:
bed <- data.frame(
CHR = paste0("chr", rep(1:3, each = 2)),
Start = c(0, 5, 15),
Stop = c(5, 15, 120)
)
bed
#> CHR Start Stop
#> 1 chr1 0 5
#> 2 chr1 5 15
#> 3 chr2 15 120
#> 4 chr2 0 5
#> 5 chr3 5 15
#> 6 chr3 15 120
Then, if you don't have them, install the packages dplyr
and purrr
(using install.packages(c("dplyr", "purrr"))
, and run the following:
library(dplyr)
library(purrr)
mapinfo %>%
left_join(bed) %>%
filter(pmap_lgl(list(M, Start, Stop), between))
#> CHR M Start Stop
#> 1 chr1 1 0 5
#> 2 chr1 10 5 15
#> 3 chr2 1 0 5
#> 4 chr2 100 15 120
#> 5 chr3 10 5 15
#> 6 chr3 100 15 120
This is a data frame of all elements in mapinfo
that were between Start
and Stop
values specified in bed
. At this point, you can go ahead and split it up as desired, remove Start
and Stop
columns, etc.
An important note: CHR
must have exactly the same label in both data frames (e.g., all caps). I noticed that you had in all caps in one and just first-letter caps in another. Be consistent here (a) for this code to work and (b) because consistency makes your code easier to digest.
To help you develop your understanding of the code, here are some relevant links to "R for Data Science" by Garrett Grolemund and Hadley Wickham:
%>%
left_join
filter
pmap_lgl
Upvotes: 1