yach
yach

Reputation: 43

Develop nucleotide sequence

I would like to develop these expressions that are in this form:

a <- "[AGAT]5GAT[AGAT]7[AGAC]6AGAT"

I would like to convert the expression like this:

b <- "AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT"

As you can see, the number after the hook means the number of times the pattern is found.

For the moment I use sub(".*[*(.*?) *].*", "\\1", seq) for select character between [] and replicate(i, "my_string") for replicate sequence between [] but I do not find how to make it work with my data.

I hope to be pretty clear.

Upvotes: 3

Views: 83

Answers (2)

akrun
akrun

Reputation: 887203

We use gsub to create 1s where there is no number before the [ ('a1'), then extract the letters and numbers separately ('v1', 'v2'), do the replication with strrep and paste the substrings to a single string ('res')

library(stringr)
a1 <- gsub("(?<![0-9])\\[", "1[", a, perl = TRUE)
v1 <- str_extract_all(a1, '[A-Z]+')[[1]]
v2 <- str_extract_all(a1, "[0-9]+")[[1]]
res <- paste(strrep(v1, as.numeric(c(tail(v2, -1), v2[1]))), collapse='')
res

-output

#[1] "AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT"

-checking with the 'b'

identical(res, b)
#[1] TRUE

A slightly more compact regex would be to change the first step

a1 <- gsub("(?<=[A-Z])(?=\\[)|(?<=[A-Z])$", "1", a, perl = TRUE)
v1 <- str_extract_all(a1, '[A-Z]+')[[1]]
v2 <- str_extract_all(a1, "[0-9]+")[[1]]
res1 <- paste(strrep(v1, as.numeric(v2)), collapse="")
identical(res1, b)
#[1] TRUE

data

a <- '[AGAT]5GAT[AGAT]7[AGAC]6AGAT'
b <- 'AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT'

Upvotes: 3

Terru_theTerror
Terru_theTerror

Reputation: 5017

Try this:

a<-"[AGAT]5GAT[AGAT]7[AGAC]6AGAT"

   list<-unlist(strsplit(unlist(strsplit(a,"\\]")),"\\["))

   number<-suppressWarnings(as.numeric(gsub("([0-9]+).*$", "\\1", list)))
   number[is.na(number)]<-1  
   seq<-gsub('[0-9]+', '', list)

   out<-paste(rep(seq[2:(length(seq))],number[c(3:length(number),2)]),collapse = '')

 b="AGATAGATAGATAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGAT"

out==b
[1] TRUE

The output is correct, but I don't know if is a general solution for every kind of data in input

Upvotes: 2

Related Questions