Karl Wolfschtagg
Karl Wolfschtagg

Reputation: 567

Apostrophes and regular expressions; Cleaning text in R

I working on cleaning a large collection of text. My process thus far is:

My issue is in the next-to-last step. Originally, my code was:

gsub(pattern = "\\b\\S\\b", replacement = "", perl = TRUE)

but this wrecked any contractions that were left (that I left in on purpose). Then I tried

gsub(pattern = "\\b(\\S^'\\s)\\b", replacement = "", perl = TRUE)

but this left a lot of single characters.

Then I realized that I needed to keep three single-letter words: "A", "I", and "O" (either case).

Any suggestions?

Upvotes: 2

Views: 212

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You can use

gsub("(?i)\\b(?<!')(?![AOI])\\p{L}\\b", "", x, perl=TRUE)

Details:

  • (?i) - case insensitive matching on
  • \b - a word boundary
  • (?<!') - no ' is allowed immediately on the left
  • (?![AOI]) - the next char cannot be A, I, or O
  • \p{L} - any Unicod letter
  • \b - a word boundary

Upvotes: 1

Related Questions