allanvc
allanvc

Reputation: 1156

Split a string AFTER a pattern occurs

I have the following vector of strings. It contains two elements. Each of the elements is composed by two collapsed phrases.

strings <- c("This is a phrase with a NameThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")

I would like to split those phrases for each element in the vector. I've been trying something like:

library(stringr)

str_split(strings, "\\B(?=[a-z|0-9][A-Z])")

which almost gives me what I'm looking for:

[[1]]
[1] "This is a phrase with a Nam" "eThis is another phrase"

[[2]]
[1] "This is a phrase with the number 201" "9This is another phrase"

I would like to make the split AFTER the pattern but cannot figure out how to do that.

I guess I'm close to a solution and would appreciate any help.

Upvotes: 6

Views: 1639

Answers (2)

milan
milan

Reputation: 4970

Alternative solution. Look for a lowercase letter or digit followed by an uppercase letter, and split in-between.

strsplit(strings, "(?<=[[:lower:][:digit:]])(?=[[:upper:]])", perl=TRUE)

[[1]]
[1] "This is a phrase with a Name" "This is another phrase"      

[[2]]
[1] "This is a phrase with the number 2019" "This is another phrase"

Upvotes: 2

CertainPerformance
CertainPerformance

Reputation: 370689

You need to match the position right before the capital letters, not the position before the last letter of the initial phrase (which is one character before the position you need). You might just match a non-word boundary with lookahead for a capital letter:

str_split(strings, "\\B(?=[A-Z])")

If the phrases can contain leading capital letters, but do not contain any capital letters after the lowercase letters start, you can split them as well with lookbehind for a digit or a lowercase letter. No non-word boundary needed this time:

strings <- c("SHOCKING NEWS: someone did somethingThis is another phrase",
         "This is a phrase with the number 2019This is another phrase")
str_split(strings, "(?<=[a-z0-9])(?=[A-Z])")

Upvotes: 4

Related Questions