JohnnyKing
JohnnyKing

Reputation: 29

Split string based on capitalized word using r

I have a string that I would like to split into several strings.

library(stringr)
testString <- "SMITH, Klaus, text, text, SMITH, Samantha, text, text,  MUELLER, Klaus, text, text,  MUELLER, Klara, text, text"

Whenever a new word is completely capitalised (followed by a comma) it should start a new string. At the end it should look like this:

[1] "VOLZ, Klaus, text, text,"
[2] "MUELLER, Klaus, text, text,"
[3] "MUELLER, Klara, text, text,"

I have tried different code here with strsplit, but I can't get r to say that it should not only search for a letter but a complete word (which can have a different number of letters) and then split the string.

strsplit(testString, "(?!^)(?<=[[:upper:]]{2})", perl=T)

Upvotes: 0

Views: 115

Answers (1)

akrun
akrun

Reputation: 886948

Use a regex lookaround - match one or more space (\\s+) that precedes one or more uppercase letter followed by a , ((?=[A-Z]+,))

strsplit(testString, "\\s+(?=[A-Z]+,)", perl = TRUE)[[1]]

-output

[1] "SMITH, Klaus, text, text," 
[2] "SMITH, Samantha, text, text," 
[3] "MUELLER, Klaus, text, text," 
[4] "MUELLER, Klara, text, text"  

Upvotes: 3

Related Questions