Khawaja Owaise Hussain
Khawaja Owaise Hussain

Reputation: 109

Substrings of whole string in R

This type of question is already asked many times, however I could not get the answer according to my needs.

I know some way of splitting strings in R. If I have a string x <- "AGCAGT", and want to split the string into characters of three. I would do this by

substring(x, seq(1, nchar(x)-1, 3), seq(3, nchar(x), 3))

and string of two character much faster by

split <- strsplit(x, "")[[1]]
substrg <- paste0(split[c(TRUE, FALSE)], split[c(FALSE, TRUE)])

As a new user of R, I feel difficulty to split string according to my requirements. If x <- "AGCTG" and if I use substring(x, seq(1, nchar(x)-1, 3), seq(3, nchar(x), 3)), I do not get the last two character substring. I get

"AGC" ""

However I am interested to get something like

"AGC" "TG"

or if I have x <- "AGCT" and splitting 3 characters at a time, I would like to get some thing like

"AGC" "T"`

I short, how to split a string into substrings of desired length (2,3,4,5...n), and also retaining those remaining characters less than the desired length.

Upvotes: 0

Views: 262

Answers (2)

Khawaja Owaise Hussain
Khawaja Owaise Hussain

Reputation: 109

Answer by zx8754. But unfortunately he deleted the answer after some marked the question as duplicate. If he would like to post this as an answer, I'l delete my post.

x <- "AGCGGCCAGCTGCCTGAA"
mylen <- 5 
ss <- strsplit(x, "")[[1]]
sapply(split(ss, ceiling(seq_along(ss)/mylen)), paste, collapse = "")

Upvotes: 1

cdeterman
cdeterman

Reputation: 19970

Here is one possible solution in a few simple steps.

x <- "AGCGGCCAGCTGCCTGAA"

# desired length
mylen = 5

# start indices
start <- seq(1, nchar(x), mylen)

# end indicies
end <- pmin(start + mylen - 1, nchar(x))

substring(x, start, end)
[1] "AGCGG" "CCAGC" "TGCCT" "GAA" 

Upvotes: 1

Related Questions