Reputation: 121578
In R, to remove punctuation from a string, I can do this:
x <- 'a#,g:?s!*$t/{u}\d\&y'
gsub('[[:punct:]]','',x)
[1] "agstudy"
This is smart but I don't have tight control about the removed punctuations (imagine I want to keep some symbols in my character). How can I rewrite this gsub
in a more more explicit way without forgetting any symbol, something like this:
gsub('[#,:?!*$/{}\\&]','',x,perl=FALSE)
EDIT
The difficulty I encountered is how to write the regular expression (I prefer in R) that removes all punctuation characters from x, and keep only # for example:
"a#gstudy"
Upvotes: 3
Views: 676
Reputation: 385897
The straightforward approach is to use a lookahead or a lookbehind to match the same character twice, once to make sure it's a punction, and once to make sure it's not "#
".
(?=[^#])[[:punct:]]
or
(?!#)[[:punct:]]
Lookahead and lookbehinds are a little expensive, though. Rather than using a lookaround at every position, it's more efficient to only use one when we find a punctuation.
[[:punct:]](?<!#)
Of course, it's even more efficient to get rid of lookarounds completely. This can be achieved through double-negation.
[^[:^punct:]#]
I haven't tested these with R, but they should at least work with perl=TRUE
.
Upvotes: 3
Reputation: 193527
Reading at this page indicates that the [[:punct:]]
characters should include:
[-!"#$%&'()*+,./:;<=>?@[\\\]^_`{|}~]
From the R ?regex
page, we also get this as verification:
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
Thus, you can possibly use that as your basis for creating your own pattern, excluding the characters you want to keep.
This is messy as heck especially with two much nicer answers, but I just wanted to show the silliness I had in mind:
Create a function that looks something like this:
newPunks <- function(CHARS) {
punks <- c("!", "\\\"", "#", "\\$", "%", "&", "'", "\\(", "\\)",
"\\*", "\\+", ",", "-", "\\.", "/", ":", ";", "<",
"=", ">", "\\?", "@", "\\[", "\\\\", "\\]", "\\^", "_",
"`", "\\{", "\\|", "\\}", "~")
keepers <- strsplit(CHARS, "")[[1]]
keepers <- ifelse(keepers %in% c("\"", "$", "{", "}", "(", ")",
"*", "+", ".", "?", "[", "]",
"^", "|", "\\"), paste0("\\", keepers), keepers)
paste(setdiff(punks, keepers), collapse="|")
}
Usage:
gsub(newPunks("#"), "", x)
# [1] "a#gstudy"
gsub(newPunks(""), "", x)
# [1] "agstudy"
gsub(newPunks("&#{"), "", x)
# [1] "a#gst{ud&y"
Bleah. Time for me to go to bed....
Upvotes: 5
Reputation: 162341
Using a negative lookahead assertion:
x <- 'a#,g:?s!*$t/{u}\\d\\&y'
gsub('(?!#)[[:punct:]]','',x, perl=TRUE)
# [1] "a#gstudy"
This in essence tests each character twice, asking once from the preceding intercharacter space whether the next character is something other than a "#"
and then, from the character itself, whether it is a punctuation symbol. If both tests are true, a match is registered and the character is removed.
Upvotes: 8
Reputation: 89557
You can use a negated character class, example:
\pP
is the unicode character class for punctuations characters.
\PP
is all that is not a punctuation character.
[^\PP]
is all that is a punctuation character.
[^\PP~]
is all that is a punctuation character except tilde.
Note: you can stay in the ASCII range by using \p{PosixPunct}
:
[^\P{PosixPunct}~]
or use unicode punctuations characters with this particularity in the ASCII range with \p{XPosixPunct}
:
[^\P{XPosixPunct}~]
Upvotes: 7
Reputation: 61510
It works exactly the same in Perl, [:punct:]
is a POSIX character class that simply maps to:
[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]
The equivalent Perl version would be:
my $x = 'a#,g:?s!*$t/{u}\d\&y';
$x =~ s/[[:punct:]]//g;
print $x;
__END__
agstudy
Upvotes: 3