Reputation: 251
I have a DNA sequence that I'm trying to determine the quality of the assembly by visualizing the presence and frequency of n (generic nucleotides). I can do this by putting it into a specific format and loading it into some other software.
The output would be tsv/csv with the following column names: startposition endposition value(end-start)
My first thought was to create a vector of characters from the string and then loop through recording the start and stop positions of a block of n's. It works, but I'm almost certain there is a better and easier way. I'm not very good with regex, but willing to use it with some help.
stringa<-"attaaatagccgcggaagacctttttcatatgatagaatccccaacacannnnnnacgtagacaeagagaagattttcccccggagagcgcgaatagannnnnnnnnccataatatttataaattaatttat"
stringvector<-strsplit(stringa, NULL)[[1]]
newdf<-data.frame(start=rep(NA, 10), end=rep(NA, 10), length=rep(NA, 10))
counter=1
for (i in 1:length(stringvector)){
if (stringvector[i]=="n"){
if (is.na(newdf[counter,][1])){
#record start position
newdf[counter,][1]<-i
} else {
#do nothing you found the start already
}
}
if (stringvector[i]!="n" & !is.na(newdf[counter,][1]) & is.na(newdf[counter,][2])){
newdf[counter,][2]<-i-1
newdf[counter,][3]<-newdf[counter,][2]-newdf[counter,][1]
counter=counter+1
}
}
The sequence lengths should not be bigger than about 50 million characters and I don't need to do this for many sequences, but was hoping for something more elegant.
Any ideas?
Upvotes: 1
Views: 1069
Reputation: 67828
Another possibility using the stringi
package:
library(stringi)
m <- stri_locate_all(str = "annaannn", regex = "n+")[[1]]
length <- m[ , "end"] - m[ , "start"] + 1
cbind(m, length)
# start end length
# [1,] 2 3 2
# [2,] 6 8 3
Edit: Added benchmark. stri_locate_all
seems fastest. Please note that I added an "end" variable to @ilir's answer to make it more comparable to the other two alternatives.
fun_stri <- function(x){
m <- stri_locate_all(str = x, regex = "n+")[[1]]
length <- m[ , "end"] - m[ , "start"] + 1
cbind(m, length)
}
fun_greg <- function(x){
result <- gregexpr(pattern="n+", c(x))[[1]]
data.frame(start = result,
end = result + attr(result, "match.length") - 1,
length = attr(result, "match.length"))
}
fun_rle <- function(stringa, char = "n") {
tmp <- rle(strsplit(stringa, NULL)[[1]])
start <- sapply(which(tmp$values == char)-1, function(x) sum(tmp$length[1:x]))+1
length <- tmp$length[tmp$values == char]
return(data.frame(start, end = start + length, length))
}
# just check results on short strings
fun_stri("annaannn")
fun_greg("annaannn")
fun_rle("annaannn")
fun_stri(stringa)
fun_greg(stringa)
fun_rle(stringa)
library(microbenchmark)
# size = 1e4
x <- paste(sample(c("a", "t", "c", "g", "n"), size = 1e4, replace = TRUE), collapse = "")
microbenchmark(
fun_stri(x),
fun_greg(x),
fun_rle(x),
times = 10)
# Unit: microseconds
# expr min lq median uq max neval
# fun_stri(x) 535.221 574.753 632.9140 684.611 711.980 10
# fun_greg(x) 709.699 748.473 776.9815 790.286 913.068 10
# fun_rle(x) 47588.602 48281.955 50071.7875 67811.410 87781.053 10
# size = 1e5
x <- paste(sample(c("a", "t", "c", "g", "n"), 1e5, replace = TRUE), collapse = "")
microbenchmark(
fun_stri(x),
fun_greg(x),
fun_rle(x),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun_stri(x) 2.871487 2.963478 3.011184 3.202578 3.272142 10
# fun_greg(x) 4.842070 4.914295 5.013888 5.368927 5.490949 10
# fun_rle(x) 3853.887170 3868.795788 3889.699217 4228.450830 4411.025536 10
# size = 5e7 ("about 50 million characters" in OP)
x <- paste(sample(c("a", "t", "c", "g", "n"), size = 5e7, replace = TRUE), collapse = "")
microbenchmark(
fun_stri(x),
fun_greg(x),
times = 10)
# Unit: seconds
# expr min lq median uq max neval
# fun_stri(x) 1.479089 1.54405 1.606676 1.757292 1.974795 10
# fun_greg(x) 2.381448 2.39754 2.422554 2.476974 2.643259 10
Upvotes: 4
Reputation: 54287
Something like this?
rle_ <- function(stringa, char = "n") {
tmp <- rle(strsplit(stringa, NULL)[[1]])
start <- sapply(which(tmp$values == char)-1, function(x) sum(tmp$length[1:x]))+1
length <- tmp$length[tmp$values == char]
return(data.frame(start, end = start + length, length))
}
rle_(stringa)
# start end length
# 1 50 56 6
# 2 99 108 9
Upvotes: 2
Reputation: 3224
Not sure if more elegant, but should certainly be faster if you use gregexpr
.
result <- gregexpr(pattern="n+", c(stringa))[[1]]
new.df <- data.frame(start=result, length=attr(result, "match.length"))
Note that this solution also needs just minor modification if stringa
is a vector. In that case gregexpr
will return a list that needs to be unpacked. It can be done without a loop, with a combination of lapply
and do.call(rbind,...)
Upvotes: 4