Reputation: 128

Regex expression to exclude hyphenated words in R

I am using R to tokenize a set of texts; after tokenization I end up with a char vector in which punctuation signs, apostrophes and hyphens are preserved. For instance, I have this original text

txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"

After the tokenization (which I perform using scan_tokenizer from package tm) I get the following char vector

   > vec1
 [1] "this"            "ain't"           "a"               "Hewlett-Packard"
 [5] "box"             "-"               "it's"            "an"             
 [9] "Apple"           "box,"            "a"               "very"           
[13] "nice"            "one!"

Now in order to get rid of the punctuation marks I do the following

vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)

This is, I substitute everything that is not alphanumerical values, spaces and apostrophes by ""; however this is the result

> vec2
 [1] "this"           "ain't"          "a"              "HewlettPackard" "box"           
 [6] ""               "it's"           "an"             "Apple"          "box"           
[11] "a"              "very"           "nice"           "one"

I want to preserve hyphenated words sych as "Hewlett-Pacakard", while getting rid of lone hyphens. Basically I need a regex to exclude hyphenated word of the form \w-\w in the gsub expression for vec2.

Your suggestions are much welcome

Upvotes: 4

Answers (6)

Ken Benoit

Reputation: 14902

I suggest two approaches, first, keep it is as simple as possible, and second, use Unicode character classes whenever possible, especially for things like hyphens that various text processors may substitute other characters for (see for instance http://www.fileformat.info/info/unicode/category/Pd/list.htm).

So:

Simplest (and also very fast), a binary match to detect only the hyphens:

vec1[!(vec1 %in% "-")]

Better (from a Unicode standpoint), also pretty fast:

vec1[!stringi::stri_detect_regex(vec1, "^\\p{Pd}$")]

The last one uses the Unicode character class Pd, representing "a dash or hyphen punctuation mark". This includes non-breaking hyphens, em dashes, etc. and the ^ and $ at the beginning and end of the regular expression mean this will be a standalone character.

Upvotes: 1

IRTFM

Reputation: 263362

If you just wnat to remove "pure hyphens" then use the pattern '^-$' (since the hyphen is not a regex meta-character.

vec2 <- vec1[!grepl( '^-$' , vec1) ]

If you wanted to remove "naked punctuation of all sorts" it might be:

vec2 <- vec1[!grepl( '^[[:punct:]]$' , vec1) ]

Upvotes: 5

Jota

Reputation: 17611

Here's an approach using strsplit with word boundaries (\b) and non-word characters (\W which is equivalent to [^[:alnum:]_])

strsplit(txt, "\\b | \\b|\\W |\\W$")
#[[1]]
# [1] "this"            "ain't"           "a"               "Hewlett-Packard"
# [5] "box"             ""                "it's"            "an"             
# [9] "Apple"           "box"             "a"               "very"           
#[13] "nice"            "one"

Or to return nothing at all for the lone hyphen instead of "".

strsplit(txt, "\\b | \\b| ?\\W |\\W$")
#[[1]]
# [1] "this"            "ain't"           "a"               "Hewlett-Packard"
# [5] "box"             "it's"            "an"              "Apple"          
# [9] "box"             "a"               "very"            "nice"
#[13] "one"

Upvotes: 2

Avinash Raj

Reputation: 174706

You may try this,

> library(stringr)    
> txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"
> gsub("(?!\\b['-]\\b|\\s)[\\W_]", "", str_extract_all(txt, "\\S+")[[1]], perl=T)
 [1] "this"            "ain't"           "a"              
 [4] "Hewlett-Packard" "box"             ""               
 [7] "it's"            "an"              "Apple"          
[10] "box"             "a"               "very"           
[13] "nice"            "one"

> strsplit(gsub('(?!\\b[[:punct:]]\\b|\\s)[\\W_]', '', txt,perl=T), ' ')[[1]]
 [1] "this"            "ain't"           "a"              
 [4] "Hewlett-Packard" "box"             ""               
 [7] "it's"            "an"              "Apple"          
[10] "box"             "a"               "very"           
[13] "nice"            "one"

Upvotes: 2

Shenglin Chen

Reputation: 4554

strsplit(gsub("[^[:alnum:][:space:]'-]", "", txt), '\\s|\\ - ')

Upvotes: 2

Pierre L

Reputation: 28441

strsplit(gsub('[[:punct:]](?!\\w)', '', txt, perl=T), ' ')[[1]]
 #[1] "this"            "ain't"           "a"              
 #[4] "Hewlett-Packard" "box"             ""               
 #[7] "it's"            "an"              "Apple"          
#[10] "box"             "a"               "very"           
#[13] "nice"            "one"

Or you can do this to keep the exclamation point after "one":

strsplit(gsub('(?<!\\w)[[:punct:]](?!\\w)', '', txt,perl=T), ' ')[[1]]
#  [1] "this"            "ain't"           "a"              
#  [4] "Hewlett-Packard" "box"             ""               
#  [7] "it's"            "an"              "Apple"          
# [10] "box,"            "a"               "very"           
# [13] "nice"            "one!"

I am using regex lookbehinds and lookaheads. The pattern (?!\\w) is a lookahead (more precisely, a negative lookahead) and tells the evaluator to remove all punctuation marks except for those that are followed by a letter or number. In the second pattern, (?<!\\w) is considered a negative lookbehind. It will remove all punctuation marks except for those that come after a letter or number. To help remember the difference, a lookbehind looks "back" at the next token, a lookahead looks "up" at what comes before it.

Upvotes: 3

Regex expression to exclude hyphenated words in R

Answers (6)

Related Questions