clerksx
clerksx

Reputation: 640

R split string with RegExp but containing those characters

How can I split the following string?

"Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"

into:

"Wes Anderson – The Grand Budapest Hotel"
"Richard Linklater – Boyhood"
"Bennett Miller – Foxcatcher"
"Morten Tyldum – The Imitation Game"

The first split point is "HotelRichard" so I think a word containing [a-z][A-Z] could be used to find the rules. But if I substitute those part using:

strsplit("HotelRichard", "[a-z][A-Z]") returns "Hote" "ichard".

Any good ideas for that?

Upvotes: 1

Views: 50

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109874

Here's an approach using a single regex (a Lookahead and Lookbehind):

strsplit(x, "(?<=[a-z])(?=[A-Z])", perl = TRUE)

## [[1]]
## [1] "Wes Anderson – The Grand Budapest Hotel"
## [2] "Richard Linklater – Boyhood"            
## [3] "Bennett Miller – Foxcatcher"            
## [4] "Morten Tyldum – The Imitation Game"     

Upvotes: 0

lawyeR
lawyeR

Reputation: 7654

First break apart the director/film mashups, then split the string at the inserted "xxx". The first steps marks two groups and then replaces them with the three x's in between.

text <- "Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"
text.split <- str_replace_all(text, "([a-z])([A-Z])", "\\1xxx\\2")
text.final <- str_split(text.split, "xxx")
text.final
[[1]]
[1] "Wes Anderson – The Grand Budapest Hotel" "Richard Linklater – Boyhood"            
[3] "Bennett Miller – Foxcatcher"             "Morten Tyldum – The Imitation Game"

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

You can try using this code where I am using a kind of a workaround to insert a § sign (hopefully, it is not that frequent if at all in your input) and then split by it:

x <- "Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"
x <- gsub("([a-z])([A-Z])","\\1§\\2",x)
strsplit(x,"§")

Sample program output:

[[1]]                                                                                                                                                               
[1] "Wes Anderson \342\200\223 The Grand Budapest Hotel"                                                                                                            
[2] "Richard Linklater \342\200\223 Boyhood"                                                                                                                        
[3] "Bennett Miller \342\200\223 Foxcatcher"                                                                                                                        
[4] "Morten Tyldum \342\200\223 The Imitation Game"  

Upvotes: 3

Related Questions