StrikeR
StrikeR

Reputation: 1628

splitting a character vector at specified intervals in R

I have some sentences in specific format and I need to split them at regular intervals.
The sentences look like this

"abxyzpqrst34245"
"mndeflmnop6346781"

I want to split each of these sentences after the following characters: c(2,5,10), so that the output will be:

[1] c("ab", "xyz", "pqrst", "34245")
[2] c("mn", "def", "lmnop", "6346781")

NOTE: The numeric character after the 3rd split is of variable length, where as the previous ones are of fixed length.

I tried to use cut, but it only works with integer vectors.
I looked at split, but I'm not sure if it works without factors.
So, I finally went with substr to divide each of the sentences separately like this:

substr("abxyzpqrst34245", 1,2)
[1] "ab"
substr("abxyzpqrst34245", 3,5)
[1] "xyz"
substr("abxyzpqrst34245", 6,10)
[1] "pqrst"
substr("abxyzpqrst34245", 11,10000)
[1] "34245"

I'm using this long process to split these strings. Is there any easier way to achieve this splitting?

Upvotes: 1

Views: 497

Answers (2)

joran
joran

Reputation: 173527

You're looking for (the often overlooked) substring:

x <- "abxyzpqrst34245"
substring(x,c(1,3,6,11),c(2,5,10,nchar(x)))
[1] "ab"    "xyz"   "pqrst" "34245"

which is handy because it is fully vectorized. If you want to do this over multiple strings in turn, you might do something like this:

x <- c("abxyzpqrst34245","mndeflmnop6346781")
> lapply(x,function(y) substring(y,first = c(1,3,6,11),last = c(2,5,10,nchar(y))))
[[1]]
[1] "ab"    "xyz"   "pqrst" "34245"

[[2]]
[1] "mn"      "def"     "lmnop"   "6346781"

Upvotes: 5

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162321

If you have a vector of strings to be split, you might also find read.fwf() handy. Use it like so:

x <- c("abxyzpqrst34245", "mndeflmnop6346781")
df <- read.fwf(file = textConnection(x), 
               widths = c(2,3,5,10000), 
               colClasses = "character")
df
#   V1  V2    V3      V4
# 1 ab xyz pqrst   34245
# 2 mn def lmnop 6346781
str(df)
# 'data.frame':   2 obs. of  4 variables:
#  $ V1: chr  "ab" "mn"
#  $ V2: chr  "xyz" "def"
#  $ V3: chr  "pqrst" "lmnop"
#  $ V4: chr  "34245" "6346781"

Upvotes: 3

Related Questions