Reputation: 3921
I'd like to split a text string in R but I want to take some aspects into consideration. For instance, if the string has a dot .
or a !
, I want my function to take them as elements of my split list. Below an example of what I want to get.
mytext="Caracas. Montevideo! Chicago."
split= "Caracas", "." ,"Montevideo", "!", "Chicago", "."
My current approach consists in replacing previously with the built-in R function gsub
the "." by " . " and then I use strsplit function as well.
mytext=gsub("\\."," .",mytext)
mytext=gsub("\\!"," !",mytext)
unlist(strsplit(mytext,split=' '))
So, my question is: is there another way of implementing this by configuring the parameters for the strsplit
function or another approach you coonsider could be more efficient.
Any help or suggestion is appreciated.
Upvotes: 3
Views: 276
Reputation: 329
eddi's solution doesn't split the whitespaces. Try this:
> regmatches(mytext, gregexpr(text=mytext, pattern="(?=[\\.\\!])|(?:\\s)", perl=T), invert=T)
[[1]]
[1] "Caracas" "." "Montevideo" "!" "Chicago" "."
Upvotes: 1
Reputation: 49448
Look-ahead is what you're looking for here:
strsplit(mytext, split = "(?=(\\.|!))", perl = TRUE)
#[[1]]
#[1] "Caracas" "." " Montevideo" "!" " Chicago" "."
Upvotes: 3