Reputation: 6568
I'm trying to remove certain words from a data frame:
name age words
James 34 hello, my name is James.
John 30 hello, my name is John. Here is my favourite website https://stackoverflow.com
Jim 27 Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>
df<-structure(list(name = structure(c(1L, 3L, 2L), .Label = c("James",
"Jim", "John"), class = "factor"), age = c(34L, 30L, 27L), message = structure(1:3, .Label = c("hello, my name is James. ",
"hello, my name is John. Here is my favourite website https://stackoverflow.com",
"Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>"
), class = "factor")), .Names = c("name", "age", "message"), class = "data.frame", row.names = c(NA,
-3L))
I'm trying to remove all words containing matches to http
or filter
.
I would like to iterate over each row, split the string on white space and then ask whether the word contains either http
or <filter>
(or other words). If so, then I want to replace the word with a space.
There are a load of questions concerning removing words that exatly match another word, or list of words, but I can't find much on removing words that match some criteria (e.g. http
or www.
).
I've tried:
gsub
, !grepl
and tm_map
approaches (e.g. this), but I can't get any of them to produce my expected output of:
name age words
James 34 hello, my name is James.
John 30 hello, my name is John. Here is my favourite website
Jim 27 Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out:
Upvotes: 3
Views: 2994
Reputation: 626738
To remove any non-whitespace chunks containing either http
or filter
(or other words) as whole words you may use gsub
with the following PCRE regex (add perl=TRUE
argument):
(?:\s+|^)\S*(?<!\w)(?:https?|<filter>)(?!\w)\S*
See the regex demo
Details
(?:\s+|^)
- 1+ wjhitespaces or start of string\S*
- 0+ non-whitespace chars as many as possible(?<!\w)
- no word char allowed immediately to the left of the current location (?:https?|<filter>)
- http
, https
or <filter>
(?!\w)
- no word char allowed immediately to the right of the current location (after the words in the alternation group)\S*
- 0+ non-whitespace chars as many as possible.See an online R demo:
df<-structure(list(name = structure(c(1L, 3L, 2L), .Label = c("James",
"Jim", "John"), class = "factor"), age = c(34L, 30L, 27L), message = structure(1:3, .Label = c("hello, my name is James. ",
"hello, my name is John. Here is my favourite website https://stackoverflow.com",
"Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out: <filter>"
), class = "factor")), .Names = c("name", "age", "message"), class = "data.frame", row.names = c(NA,
-3L))
df$message <- gsub("(?:\\s+|^)\\S*(?<!\\w)(?:https?|<filter>)(?!\\w)\\S*", "", df$message, perl=TRUE)
df$message
Result:
[1] "hello, my name is James. "
[2] "hello, my name is John. Here is my favourite website"
[3] "Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out:"
Upvotes: 2