fugu
fugu

Reputation: 6568

Remove words from string

I'm trying to remove certain words from a data frame:

name    age words
James   34  hello, my name is James. 
John    30  hello, my name is John. Here is my favourite website https://stackoverflow.com
Jim 27  Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>

df<-structure(list(name = structure(c(1L, 3L, 2L), .Label = c("James", 
"Jim", "John"), class = "factor"), age = c(34L, 30L, 27L), message = structure(1:3, .Label = c("hello, my name is James. ", 
"hello, my name is John. Here is my favourite website https://stackoverflow.com", 
"Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>"
), class = "factor")), .Names = c("name", "age", "message"), class = "data.frame", row.names = c(NA, 
-3L))

I'm trying to remove all words containing matches to http or filter.

I would like to iterate over each row, split the string on white space and then ask whether the word contains either http or <filter> (or other words). If so, then I want to replace the word with a space.

There are a load of questions concerning removing words that exatly match another word, or list of words, but I can't find much on removing words that match some criteria (e.g. http or www.).

I've tried:

gsub, !grepl and tm_map approaches (e.g. this), but I can't get any of them to produce my expected output of:

name    age words
James   34  hello, my name is James. 
John    30  hello, my name is John. Here is my favourite website 
Jim 27  Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out: 

Upvotes: 3

Views: 2994

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

To remove any non-whitespace chunks containing either http or filter (or other words) as whole words you may use gsub with the following PCRE regex (add perl=TRUE argument):

(?:\s+|^)\S*(?<!\w)(?:https?|<filter>)(?!\w)\S*

See the regex demo

Details

  • (?:\s+|^) - 1+ wjhitespaces or start of string
  • \S* - 0+ non-whitespace chars as many as possible
  • (?<!\w) - no word char allowed immediately to the left of the current location
  • (?:https?|<filter>) - http, https or <filter>
  • (?!\w) - no word char allowed immediately to the right of the current location (after the words in the alternation group)
  • \S* - 0+ non-whitespace chars as many as possible.

See an online R demo:

df<-structure(list(name = structure(c(1L, 3L, 2L), .Label = c("James", 
"Jim", "John"), class = "factor"), age = c(34L, 30L, 27L), message = structure(1:3, .Label = c("hello, my name is James. ", 
"hello, my name is John. Here is my favourite website https://stackoverflow.com", 
"Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out: <filter>"
), class = "factor")), .Names = c("name", "age", "message"), class = "data.frame", row.names = c(NA, 
-3L))
df$message <- gsub("(?:\\s+|^)\\S*(?<!\\w)(?:https?|<filter>)(?!\\w)\\S*", "", df$message, perl=TRUE)
df$message

Result:

[1] "hello, my name is James. "                                                                         
[2] "hello, my name is John. Here is my favourite website"                                              
[3] "Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out:"

Upvotes: 2

akrun
akrun

Reputation: 887028

We can use gsub

gsub("\\s(https:\\S+|<filter>)", "", df$message)

Upvotes: 2

Related Questions