JBH
JBH

Reputation: 101

Removing StopWords from a Character using R

Consider that I have the below mentioned String;

str_input <- c("Mellanox,Asia, China, India, JAVA, United States, APIs")

I have used the below mentioned gsub code which removes my specific StopWords.

gsub(paste0("\\b(",paste(location_sw, collapse="|"),")\\b"), "", str_input)

where, location_sw consists of my list of stopwords as mentioned below

location_sw <- c('Rose', 'Java', 'JAVA', 'Mellanox', 'Microsoft', '144GiB', 'West',
                 'Amazon', 'Channel Asia', 'jClarity', 'APIs')

On using the above provided gsub code, I am getting the below mentioned output

",Asia, China, India, , United States, "

However, I would like the following outcome;

"Asia, China, India, United States"

I would like to remove the commas present after removing the stopwords. Any inputs will be really helpfull.

Upvotes: 4

Views: 328

Answers (3)

NelsonGon
NelsonGon

Reputation: 13319

A base option:

paste(lapply(strsplit(str_input,",|,\\s"), function(x) 
               x[!x %in% location_sw])[[1]],collapse=", ")
    [1] "Asia, China, India, United States"

Upvotes: 1

Joris C.
Joris C.

Reputation: 6234

Another approach is to strsplit the string into a character vector and then taking the setdiff with respect to location_sw:

out <- setdiff(strsplit(str_input, split = ",\\s*")[[1]], location_sw)
out
#> [1] "Asia"          "China"         "India"         "United States"

If necessary, we can paste it back to a character:

paste(out, collapse = ", ")
#> [1] "Asia, China, India, United States"

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627129

You may use

str_input <- c("Mellanox,Asia, China, India, JAVA, United States, APIs")
rx <- paste0("(?:,\\s*)*\\b(?:",paste(location_sw, collapse="|"),")\\b")
trimws(gsub(rx, "", str_input), whitespace = "[\\s,]")
## => [1] "Asia, China, India, United States"

The (?:,\\s*) will match 0 or more occurrences of a comma followed with 0 or more whitespaces.

The trimws with whitespace = "[\\s,]" will remove leading and trailing whitespace and commas.

Upvotes: 3

Related Questions