agstudy
agstudy

Reputation: 121578

equivalent regular expression to remove all punctuations

In R, to remove punctuation from a string, I can do this:

x <- 'a#,g:?s!*$t/{u}\d\&y'
gsub('[[:punct:]]','',x)
[1] "agstudy"

This is smart but I don't have tight control about the removed punctuations (imagine I want to keep some symbols in my character). How can I rewrite this gsub in a more more explicit way without forgetting any symbol, something like this:

gsub('[#,:?!*$/{}\\&]','',x,perl=FALSE)

EDIT

The difficulty I encountered is how to write the regular expression (I prefer in R) that removes all punctuation characters from x, and keep only # for example:

 "a#gstudy"

Upvotes: 3

Views: 676

Answers (5)

ikegami
ikegami

Reputation: 385897

The straightforward approach is to use a lookahead or a lookbehind to match the same character twice, once to make sure it's a punction, and once to make sure it's not "#".

(?=[^#])[[:punct:]]

or

(?!#)[[:punct:]]

Lookahead and lookbehinds are a little expensive, though. Rather than using a lookaround at every position, it's more efficient to only use one when we find a punctuation.

[[:punct:]](?<!#)

Of course, it's even more efficient to get rid of lookarounds completely. This can be achieved through double-negation.

[^[:^punct:]#]

I haven't tested these with R, but they should at least work with perl=TRUE.

Upvotes: 3

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

Reading at this page indicates that the [[:punct:]] characters should include:

[-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]

From the R ?regex page, we also get this as verification:

[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Thus, you can possibly use that as your basis for creating your own pattern, excluding the characters you want to keep.


This is messy as heck especially with two much nicer answers, but I just wanted to show the silliness I had in mind:

Create a function that looks something like this:

newPunks <- function(CHARS) {
  punks <- c("!", "\\\"", "#", "\\$", "%", "&", "'", "\\(", "\\)",
             "\\*", "\\+", ",", "-", "\\.", "/", ":", ";", "<",
             "=", ">", "\\?", "@", "\\[", "\\\\", "\\]", "\\^", "_", 
             "`", "\\{", "\\|", "\\}", "~")
  keepers <- strsplit(CHARS, "")[[1]]
  keepers <- ifelse(keepers %in% c("\"", "$", "{", "}", "(", ")",
                                   "*", "+", ".", "?", "[", "]",
                                   "^", "|", "\\"), paste0("\\", keepers), keepers)
  paste(setdiff(punks, keepers), collapse="|")
}

Usage:

gsub(newPunks("#"), "", x)
# [1] "a#gstudy"
gsub(newPunks(""), "", x)
# [1] "agstudy"
gsub(newPunks("&#{"), "", x)
# [1] "a#gst{ud&y"

Bleah. Time for me to go to bed....

Upvotes: 5

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162341

Using a negative lookahead assertion:

x <- 'a#,g:?s!*$t/{u}\\d\\&y'

gsub('(?!#)[[:punct:]]','',x, perl=TRUE)
# [1] "a#gstudy"

This in essence tests each character twice, asking once from the preceding intercharacter space whether the next character is something other than a "#" and then, from the character itself, whether it is a punctuation symbol. If both tests are true, a match is registered and the character is removed.

Upvotes: 8

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can use a negated character class, example:

\pP is the unicode character class for punctuations characters.

\PP is all that is not a punctuation character.

[^\PP] is all that is a punctuation character.

[^\PP~] is all that is a punctuation character except tilde.

Note: you can stay in the ASCII range by using \p{PosixPunct}:

[^\P{PosixPunct}~]

or use unicode punctuations characters with this particularity in the ASCII range with \p{XPosixPunct}:

[^\P{XPosixPunct}~]

Upvotes: 7

Hunter McMillen
Hunter McMillen

Reputation: 61510

It works exactly the same in Perl, [:punct:] is a POSIX character class that simply maps to:

[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]

The equivalent Perl version would be:

my $x = 'a#,g:?s!*$t/{u}\d\&y';
$x =~ s/[[:punct:]]//g;
print $x;

__END__
agstudy

Upvotes: 3

Related Questions