Reputation: 21400
I have texts that contain quotes, some of which contain punctuation and special characters like arrows. Example:
quotes <- c("He was thinking “my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”")
I'd like to extract just the quotes using regexes. So far I've been toying around with the package stringr
; specifically str_subset()
might be relevant but I'm too inexperienced with regexes.
Any help?
Upvotes: 1
Views: 58
Reputation: 15897
You can do this by using the regex capabilities from the base package:
quotes <- c("He was thinking “my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”")
pattern <- "“[^”]*”"
matches <- gregexpr(pattern, quotes)
regmatches(quotes, matches)
## [[1]]
## [1] "“my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”"
The function gregexpr()
looks for all occurrences of the pattern inside quotes
. The function regmatches()
can then be used to extract the actual text that has been matched.
The pattern matches the start and end quotes and any characters in between except for an end quote. Excluding the end quotes is achieved using [^”]
, which matches any character except for ”
.
Two additional remarks:
“.*”
because the matching is greedy. This pattern would match everything from the first start until the last end quote.pattern <- "\u201c[^\u201d]*\u201d"
Upvotes: 1