Robert
Robert

Reputation: 530

Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words

I've been searching for an solution on Stackoverflow and experimenting in R (RStudio) for hours. I know how to remove punctuation while preserving apostrophes, intra-word dashes and intra-word &'s (for AT&T) with gsub (not with the tm package but I'd like to know if some one could offer a tip with regard to this operation in tandem with the following issue). I would like to know how to prevent concatenating the words with gsub, or any other regular expression procedure, where the punctuation that I removed once was. So far, this is the best I can do:

x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating  is a new**$ballgame but----why--- not?"

gsub("(\\w['&-]\\w)|[[:punct:]]", "\\1", x, perl=TRUE) 

#[1] "Good luckSPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventingconcatenating  is a newballgame butwhy not"

Any idea's? The purpose of this question is to apply the solution to a data frame column or corpus of social media posts by the way.

Upvotes: 2

Views: 1456

Answers (2)

hwnd
hwnd

Reputation: 70732

You can go as far as leaving only leading/trailing whitespace with one function:

gsub("[[:punct:]]* *(\\w+[&'-]\\w+)|[[:punct:]]+ *| {2,}", " \\1", x)
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not "

If you're able to use the qdapRegex package, you could do:

library(qdapRegex)
rm_default(x, pattern = "[^ a-zA-Z&'-]|[&'-]{2,}", replacement = " ")
# [1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating is a new ballgame but why not"

Upvotes: 2

Mariano
Mariano

Reputation: 6511

You could:

  1. match all spaces before and after each punctuation sign, and use 1 space in the replacement
  2. restrict [-'&] to match only if after or before a non-word boundary \B

Regex:

\s*(?:(?:\B[-'&]+|[-'&]+\B|[^-'&[:^punct:]]+)\s*)+
  • Notice I'm using a double negative in [^-'&[:^punct:]] to exclude -'& from the POSIX class [:punct:]

Replacement:

" "   (1 space)

regex101 Demo

Code:

x <-"Good luck!!!!SPRINT I like good deals. I can't lie brand-new stuff---- excites me got&&&&& to say yo, At&t why? a* dash-- apostrophe's''' I can do all-day. But preventing%%%%concatenating  is a new**$ballgame but----why--- not?"

gsub("\\s*(?:(?:\\B[-'&]+|[-'&]+\\B|[^-'&[:^punct:]]+)\\s*)+", " ", x, perl=TRUE)

#[1] "Good luck SPRINT I like good deals I can't lie brand-new stuff excites me got to say yo At&t why a dash apostrophe's I can do all-day But preventing concatenating  is a new ballgame but why not "

ideone Demo

Upvotes: 0

Related Questions