splitting a character vector at specified intervals in R

Question

I have some sentences in specific format and I need to split them at regular intervals.
The sentences look like this

"abxyzpqrst34245"
"mndeflmnop6346781"

I want to split each of these sentences after the following characters: c(2,5,10), so that the output will be:

[1] c("ab", "xyz", "pqrst", "34245")
[2] c("mn", "def", "lmnop", "6346781")

NOTE: The numeric character after the 3rd split is of variable length, where as the previous ones are of fixed length.

I tried to use cut, but it only works with integer vectors.
I looked at split, but I'm not sure if it works without factors.
So, I finally went with substr to divide each of the sentences separately like this:

substr("abxyzpqrst34245", 1,2)
[1] "ab"
substr("abxyzpqrst34245", 3,5)
[1] "xyz"
substr("abxyzpqrst34245", 6,10)
[1] "pqrst"
substr("abxyzpqrst34245", 11,10000)
[1] "34245"

I'm using this long process to split these strings. Is there any easier way to achieve this splitting?

joran · Accepted Answer

You're looking for (the often overlooked) substring:

x <- "abxyzpqrst34245"
substring(x,c(1,3,6,11),c(2,5,10,nchar(x)))
[1] "ab"    "xyz"   "pqrst" "34245"

which is handy because it is fully vectorized. If you want to do this over multiple strings in turn, you might do something like this:

x <- c("abxyzpqrst34245","mndeflmnop6346781")
> lapply(x,function(y) substring(y,first = c(1,3,6,11),last = c(2,5,10,nchar(y))))
[[1]]
[1] "ab"    "xyz"   "pqrst" "34245"

[[2]]
[1] "mn"      "def"     "lmnop"   "6346781"

splitting a character vector at specified intervals in R

Answers (2)

Related Questions