Reputation: 530
I've been searching for an solution on Stackoverflow and experimenting in R (RStudio) for hours. I know how to remove punctuation while preserving apostrophes, intra-word dashes and intra-word &'s (for AT&T) with gsub (not with the tm package but I'd like to know if some one could offer a tip with regard to this operation in tandem with the following issue). I would like to know how to prevent concatenating the words with gsub, or any other regular expression procedure, where the punctuation that I removed once was. So far, this is the best I can do:
x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating is a new**$ballgame but----why--- not?"
gsub("(\\w['&-]\\w)|[[:punct:]]", "\\1", x, perl=TRUE)
#[1] "Good luckSPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventingconcatenating is a newballgame butwhy not"
Any idea's? The purpose of this question is to apply the solution to a data frame column or corpus of social media posts by the way.
Upvotes: 2
Views: 1456
Reputation: 70732
You can go as far as leaving only leading/trailing whitespace with one function:
gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}", " \\1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "
If you're able to use the qdapRegex package, you could do:
library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"
Upvotes: 2
Reputation: 6511
You could:
[-'&]
to match only if after or before a non-word boundary \B
Regex:
\s*(?:(?:\B[-'&]+|[-'&]+\B|[^-'&[:^punct:]]+)\s*)+
[^-'&[:^punct:]]
to exclude -'&
from the POSIX class [:punct:]
Replacement:
" " (1 space)
Code:
x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating is a new**$ballgame but----why--- not?"
gsub("\\s*(?:(?:\\B[-'&]+|[-'&]+\\B|[^-'&[:^punct:]]+)\\s*)+", " ", x, perl=TRUE)
#[1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "
Upvotes: 0