Reputation: 1731
I'm cleaning a text and I'd like to remove any apostrophe except for the ones preceded and followed by letters such as in : i'm, i'll, he's..etc.
I the following preliminary solution, handling many cases, but I want a better one:
rmAps <- function(x) gsub("^\'+| \'+|\'+ |[^[:alpha:]]\'+(a-z)*|\\b\'*$", " ", x)
rmAps("'i'm '' ' 'we end' '")
[1] " i'm we end "
I also tried:
(?<![a-z])'(?![a-z])
But I think I am still missing sth.
Upvotes: 1
Views: 315
Reputation: 17621
gsub("'(?!\\w)|(?<!\\w)'", "", x, perl = TRUE)
#[1] "i'm we end "
Remove occasions when your character is not followed by a word character: '(?!\\w)
.
Remove occasions when your character is not preceded by a word character: (?<!\\w)'
.
If either of those situations occur, you want to remove it, so '(?!\\w)|(?<!\\w)'
should do the trick. Just note that \\w
includes the underscore, and adjust as necessary.
Another option is
gsub("\\w'\\w(*SKIP)(*FAIL)|'", "", x, perl = TRUE)
In this case, you match any instances when '
is surrounded by word characters: \\w'\\w
, and then force that match to fail with (*SKIP)(*FAIL)
. But, also look for '
using |'
. The result is that only occurrences of '
not wrapped in word characters will be matched and substituted out.
Upvotes: 2
Reputation: 16089
You can use the following regular expression:
(?<=\w)'(?=\w)
(?<=)
is a positive lookbehind. Everything inside needs to match before the next selector(?=)
is a positive lookahead. Everything inside needs to match after the previous selector\w
any alphanumeric character and the underscoreYou could also switch \w
to e.g. [a-zA-Z]
if you want to restrict the results.
→ Here is your example on regex101 for live testing.
Upvotes: 1