Reputation: 128
I am using R to tokenize a set of texts; after tokenization I end up with a char vector in which punctuation signs, apostrophes and hyphens are preserved. For instance, I have this original text
txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"
After the tokenization (which I perform using scan_tokenizer
from package tm
) I get the following char vector
> vec1
[1] "this" "ain't" "a" "Hewlett-Packard"
[5] "box" "-" "it's" "an"
[9] "Apple" "box," "a" "very"
[13] "nice" "one!"
Now in order to get rid of the punctuation marks I do the following
vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)
This is, I substitute everything that is not alphanumerical values, spaces and apostrophes by ""; however this is the result
> vec2
[1] "this" "ain't" "a" "HewlettPackard" "box"
[6] "" "it's" "an" "Apple" "box"
[11] "a" "very" "nice" "one"
I want to preserve hyphenated words sych as "Hewlett-Pacakard", while getting rid of lone hyphens. Basically I need a regex to exclude hyphenated word of the form \w-\w
in the gsub
expression for vec2.
Your suggestions are much welcome
Upvotes: 4
Views: 1429
Reputation: 14902
I suggest two approaches, first, keep it is as simple as possible, and second, use Unicode character classes whenever possible, especially for things like hyphens that various text processors may substitute other characters for (see for instance http://www.fileformat.info/info/unicode/category/Pd/list.htm).
So:
Simplest (and also very fast), a binary match to detect only the hyphens:
vec1[!(vec1 %in% "-")]
Better (from a Unicode standpoint), also pretty fast:
vec1[!stringi::stri_detect_regex(vec1, "^\\p{Pd}$")]
The last one uses the Unicode character class Pd
, representing "a dash or hyphen punctuation mark". This includes non-breaking hyphens, em dashes, etc. and the ^
and $
at the beginning and end of the regular expression mean this will be a standalone character.
Upvotes: 1
Reputation: 263362
If you just wnat to remove "pure hyphens" then use the pattern '^-$'
(since the hyphen is not a regex meta-character.
vec2 <- vec1[!grepl( '^-$' , vec1) ]
If you wanted to remove "naked punctuation of all sorts" it might be:
vec2 <- vec1[!grepl( '^[[:punct:]]$' , vec1) ]
Upvotes: 5
Reputation: 17611
Here's an approach using strsplit
with word boundaries (\b
) and non-word characters (\W
which is equivalent to [^[:alnum:]_]
)
strsplit(txt, "\\b | \\b|\\W |\\W$")
#[[1]]
# [1] "this" "ain't" "a" "Hewlett-Packard"
# [5] "box" "" "it's" "an"
# [9] "Apple" "box" "a" "very"
#[13] "nice" "one"
Or to return nothing at all for the lone hyphen instead of ""
.
strsplit(txt, "\\b | \\b| ?\\W |\\W$")
#[[1]]
# [1] "this" "ain't" "a" "Hewlett-Packard"
# [5] "box" "it's" "an" "Apple"
# [9] "box" "a" "very" "nice"
#[13] "one"
Upvotes: 2
Reputation: 174706
You may try this,
> library(stringr)
> txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"
> gsub("(?!\\b['-]\\b|\\s)[\\W_]", "", str_extract_all(txt, "\\S+")[[1]], perl=T)
[1] "this" "ain't" "a"
[4] "Hewlett-Packard" "box" ""
[7] "it's" "an" "Apple"
[10] "box" "a" "very"
[13] "nice" "one"
or
> strsplit(gsub('(?!\\b[[:punct:]]\\b|\\s)[\\W_]', '', txt,perl=T), ' ')[[1]]
[1] "this" "ain't" "a"
[4] "Hewlett-Packard" "box" ""
[7] "it's" "an" "Apple"
[10] "box" "a" "very"
[13] "nice" "one"
Upvotes: 2
Reputation: 4554
strsplit(gsub("[^[:alnum:][:space:]'-]", "", txt), '\\s|\\ - ')
Upvotes: 2
Reputation: 28441
strsplit(gsub('[[:punct:]](?!\\w)', '', txt, perl=T), ' ')[[1]]
#[1] "this" "ain't" "a"
#[4] "Hewlett-Packard" "box" ""
#[7] "it's" "an" "Apple"
#[10] "box" "a" "very"
#[13] "nice" "one"
Or you can do this to keep the exclamation point after "one":
strsplit(gsub('(?<!\\w)[[:punct:]](?!\\w)', '', txt,perl=T), ' ')[[1]]
# [1] "this" "ain't" "a"
# [4] "Hewlett-Packard" "box" ""
# [7] "it's" "an" "Apple"
# [10] "box," "a" "very"
# [13] "nice" "one!"
I am using regex lookbehinds and lookaheads. The pattern (?!\\w)
is a lookahead (more precisely, a negative lookahead) and tells the evaluator to remove all punctuation marks except for those that are followed by a letter or number. In the second pattern, (?<!\\w)
is considered a negative lookbehind. It will remove all punctuation marks except for those that come after a letter or number. To help remember the difference, a lookbehind looks "back" at the next token, a lookahead looks "up" at what comes before it.
Upvotes: 3