Reputation: 1628
I have some sentences in specific format and I need to split them at regular intervals.
The sentences look like this
"abxyzpqrst34245"
"mndeflmnop6346781"
I want to split each of these sentences after the following characters: c(2,5,10), so that the output will be:
[1] c("ab", "xyz", "pqrst", "34245")
[2] c("mn", "def", "lmnop", "6346781")
NOTE: The numeric character after the 3rd split is of variable length, where as the previous ones are of fixed length.
I tried to use cut
, but it only works with integer vectors.
I looked at split
, but I'm not sure if it works without factors.
So, I finally went with substr
to divide each of the sentences separately like this:
substr("abxyzpqrst34245", 1,2)
[1] "ab"
substr("abxyzpqrst34245", 3,5)
[1] "xyz"
substr("abxyzpqrst34245", 6,10)
[1] "pqrst"
substr("abxyzpqrst34245", 11,10000)
[1] "34245"
I'm using this long process to split these strings. Is there any easier way to achieve this splitting?
Upvotes: 1
Views: 497
Reputation: 173527
You're looking for (the often overlooked) substring
:
x <- "abxyzpqrst34245"
substring(x,c(1,3,6,11),c(2,5,10,nchar(x)))
[1] "ab" "xyz" "pqrst" "34245"
which is handy because it is fully vectorized. If you want to do this over multiple strings in turn, you might do something like this:
x <- c("abxyzpqrst34245","mndeflmnop6346781")
> lapply(x,function(y) substring(y,first = c(1,3,6,11),last = c(2,5,10,nchar(y))))
[[1]]
[1] "ab" "xyz" "pqrst" "34245"
[[2]]
[1] "mn" "def" "lmnop" "6346781"
Upvotes: 5
Reputation: 162321
If you have a vector of strings to be split, you might also find read.fwf()
handy. Use it like so:
x <- c("abxyzpqrst34245", "mndeflmnop6346781")
df <- read.fwf(file = textConnection(x),
widths = c(2,3,5,10000),
colClasses = "character")
df
# V1 V2 V3 V4
# 1 ab xyz pqrst 34245
# 2 mn def lmnop 6346781
str(df)
# 'data.frame': 2 obs. of 4 variables:
# $ V1: chr "ab" "mn"
# $ V2: chr "xyz" "def"
# $ V3: chr "pqrst" "lmnop"
# $ V4: chr "34245" "6346781"
Upvotes: 3