Reputation: 109874
I have a need to split on words and end marks (punctuation of certain types). Oddly pipe ("|") can count as an end mark. I have code that words on end marks until I try to add the pipe. Adding the pipe makes the strsplit
every character. Escaping it causes and error. How can I include the pipe int he regular expression?
x <- "I like the dog|."
strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE)
#[[1]]
#[1] "I" "like" "the" "dog|" "."
strsplit(x, "[[:space:]]|(?=[.!?*-\|])", perl=TRUE)
#Error: '\|' is an unrecognized escape in character string starting "[[:space:]]|(?=[.!?*-\|"
The outcome I'd like:
#[[1]]
#[1] "I" "like" "the" "dog" "|" "." #pipe is an element
Upvotes: 17
Views: 38814
Reputation: 176668
One way to solve this is to use the \Q...\E
notation to remove the special meaning of any of the characters in ...
. As it says in ?regex
:
If you want to remove the special meaning from a sequence of characters, you can do so by putting them between ‘\Q’ and ‘\E’. This is different from Perl in that ‘$’ and ‘@’ are handled as literals in ‘\Q...\E’ sequences in PCRE, whereas in Perl, ‘$’ and ‘@’ cause variable interpolation.
For example:
> strsplit(x, "[[:space:]]|(?=[\\Q.!?*-|\\E])", perl=TRUE)
[[1]]
[1] "I" "like" "the" "dog" "|" "."
Upvotes: 19
Reputation: 193547
The problem is actually your hyphen, which should come either first or last:
strsplit(x, "[[:space:]]|(?=[|.!?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[.|!?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[.!|?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[-|.!?*])", perl=TRUE)
and so on should all give you the output you are looking for.
You can also escape the hyphen if you prefer, but remember to use two backslashes!
strsplit(x, "[[:space:]]|(?=[.!?*\\-|])", perl=TRUE)
Upvotes: 12