pheeper
pheeper

Reputation: 1527

R regex remove apostrophes NOT between letters

I'm able to remove all punctuation from a string while keeping apostrophes, but I'm now stuck on how to remove any apostrophes that are not between two letters.

str1 <- "I don't know 'how' to remove these ' things"

Should look like this:

"I don't know how to remove these things"

Upvotes: 2

Views: 2237

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You may use a regex approach:

str1 <- "I don't know 'how' to remove these ' things"
gsub("\\s*'\\B|\\B'\\s*", "", str1)

See this IDEONE demo and a regex demo.

The regex matches:

  • \\s*'\\B - 0+ whitespaces, ' and a non-word boundary
  • | - or
  • \\B'\\s* - a non-word boundary, ' and 0+ whitespaces

If you do not need to care about the extra whitespace that can remain after removing standalone ', you can use a PCRE regex like

\b'\b(*SKIP)(*F)|'

See the regex demo

Explanation:

  • \b'\b - match a ' in-between word characters
  • (*SKIP)(*F) - and omit the match
  • | - or match...
  • ' - an apostrophe in another context.

See an IDEONE demo:

gsub("\\b'\\b(*SKIP)(*F)|'", "", str1, perl=TRUE)

To account for apostrophes in-between Unicode letters, add (*UTF)(*UCP) flags at the start of the pattern and use a perl=TRUE argument:

gsub("(*UTF)(*UCP)\\s*'\\B|\\B'\\s*", "", str1, perl=TRUE)
      ^^^^^^^^^^^^                              ^^^^^^^^^     

Or

gsub("(*UTF)(*UCP)\\b'\\b(*SKIP)(*F)|'", "", str1, perl=TRUE) 
      ^^^^^^^^^^^^                                 

See another IDEONE demo

Upvotes: 5

lmo
lmo

Reputation: 38510

This method using gsub work:

gsub("(([^A-Za-z])'|'([^A-Za-z]))", "\\2 ", str1)

"I don't know  how to remove these   things"

It would require a second round to remove extra spaces. So

gsub("  +", " ", gsub("(([^A-Za-z])'|'([^A-Za-z]))", "\\2 ", str1))
  • [^A-Za-z] says all non-alphabetical characters
  • | is an or statement
  • () capture matched sub-expressions
  • \\2 is called a back reference and returns the second captured sub-expressions

Upvotes: 4

Tyler Rinker
Tyler Rinker

Reputation: 109994

Here's one approach using lookarounds in base:

gsub("(?<![a-zA-Z])(')|(')(?![a-zA-Z])", "", str1, perl=TRUE)
## [1] "I don't know how to remove these  things"

Regular expression visualization

Upvotes: 3

Related Questions