Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Matching quotes with special characters in R

I have texts that contain quotes, some of which contain punctuation and special characters like arrows. Example:

 quotes <- c("He was thinking “my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”")

I'd like to extract just the quotes using regexes. So far I've been toying around with the package stringr; specifically str_subset() might be relevant but I'm too inexperienced with regexes. Any help?

Upvotes: 1

Views: 58

Answers (1)

Stibu
Stibu

Reputation: 15897

You can do this by using the regex capabilities from the base package:

quotes <- c("He was thinking “my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”")

pattern <- "“[^”]*”"
matches <- gregexpr(pattern, quotes)
regmatches(quotes, matches)
## [[1]]
## [1] "“my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”"

The function gregexpr() looks for all occurrences of the pattern inside quotes. The function regmatches() can then be used to extract the actual text that has been matched.

The pattern matches the start and end quotes and any characters in between except for an end quote. Excluding the end quotes is achieved using [^”], which matches any character except for .

Two additional remarks:

  • You can not use the pattern “.*” because the matching is greedy. This pattern would match everything from the first start until the last end quote.
  • You can also express the pattern using unicode code points: pattern <- "\u201c[^\u201d]*\u201d"

Upvotes: 1

Related Questions