Reputation: 567
I working on cleaning a large collection of text. My process thus far is:
My issue is in the next-to-last step. Originally, my code was:
gsub(pattern = "\\b\\S\\b", replacement = "", perl = TRUE)
but this wrecked any contractions that were left (that I left in on purpose). Then I tried
gsub(pattern = "\\b(\\S^'\\s)\\b", replacement = "", perl = TRUE)
but this left a lot of single characters.
Then I realized that I needed to keep three single-letter words: "A", "I", and "O" (either case).
Any suggestions?
Upvotes: 2
Views: 212
Reputation: 627292
You can use
gsub("(?i)\\b(?<!')(?![AOI])\\p{L}\\b", "", x, perl=TRUE)
Details:
(?i)
- case insensitive matching on\b
- a word boundary(?<!')
- no '
is allowed immediately on the left(?![AOI])
- the next char cannot be A
, I
, or O
\p{L}
- any Unicod letter\b
- a word boundaryUpvotes: 1