Reputation: 35
I have derived all the start and stop positions within a DNA string and now I would like to map each start position with each stop position, both of which are vectors and then use these positions to extract corresponding sub strings from the DNA string sequence. But I am unable to efficiently loop through both vectors to achieve this, especially as they are not of the same length.
I have tried different versions of loops (for, ifelse) but I am not quite able to wrap my head around a solution yet.
Here is an example of one of my several attempts at solve this problem.
new = data.frame()
for (i in start_pos){
for (j in stop_pos){
while (j>i){
new[j,1]=i
new[j,2]=j
}
}
}
Here is an example of my desired result: start = c(1,5,7, 9, 15) stop = c(4, 13, 20, 30, 40, 50). My desired result would ideally be a dataframe of two columns mapping each start to its stop position. I only want to add rows on to df where by start values are greater than its corresponding stop values (multiple start values can have same stop values as long as it fulfills this criteria)as shown in my example below.
i.e first row df= (1,4)
second row df= (5,13)
third row df = (7, 13 )
fourth row df = (9,13)
fifth row df = (15, 20)
Upvotes: 2
Views: 156
Reputation: 13319
Here is a possible tidyverse
solution:
library(purrr)
library(plyr)
library(dplyr)
The map2
is used to map values of the two vectors(start and stop). We then make one vector out of these followed by unlist
ing and combining our results into a data.frame
object.
EDIT: With the updated condition, we can do something like:
start1= c(118,220, 255)
stop1 =c(115,210,260)
res<-purrr::map2(start1[1:length(stop1)],stop1,function(x,y) c(x,y[y>x]))
res[unlist(lapply(res,function(x) length(x)>1))]
# [[1]]
# [1] 255 260
ORIGINAL:
plyr::ldply(purrr::map2(start[1:length(stop)],stop,function(x,y) c(x,y)),unlist) %>%
setNames(nm=c("start","stop")) %>%
mutate(newCol=paste0("(",start,",",stop,")"))
# start stop newCol
#1 1 4 (1,4)
#2 5 13 (5,13)
#3 15 20 (15,20)
#4 NA 30 (NA,30)
#5 NA 40 (NA,40)
#6 NA 50 (NA,50)
Alternative: A clever way is shown by @Marius .The key is to have corresponding lengths.
plyr::ldply(purrr::map2(start,stop[1:length(start)],function(x,y) c(x,y)),unlist) %>%
setNames(nm=c("start","stop")) %>%
mutate(newCol=paste0("(",start,",",stop,")"))
start stop newCol
1 1 4 (1,4)
2 5 13 (5,13)
3 15 20 (15,20)
Upvotes: 1
Reputation: 60080
Here's a fairly simple solution - it's probably good not to over-complicate things unless you're sure you need the extra complexity. The starts and stops already seem to be matched up, you just might have more of one than the other, so you can find the length of the shortest vector and only use that many items from start
and stop
:
start = c(1, 5, 15)
stop = c(4, 13, 20, 30, 40, 50)
min_length = min(length(start), length(stop))
df = data.frame(
start = start[1:min_length],
stop = stop[1:min_length]
)
EDIT: after reading some of your comments here, it looks like your problem actually is more complicated than it first seemed (coming up with examples that demonstrate the level of complexity you need, without being overly complex, is always tricky). If you want to match each start with the next stop that's greater than the start, you can do:
# Slightly modified example: multiple starts
# that can be matched with one stop
start = c(1, 5, 8)
stop = c(4, 13, 20, 30, 40, 50)
df2 = data.frame(
start = start,
stop = sapply(start, function(s) { min(stop[stop > s]) })
)
Upvotes: 1