Reputation: 745
I have this data frame
df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L))
ID Text
1 1 there was not clostridium
2 2 clostridium difficile positive
3 3 test was OK but there was clostridium
And pattern of stop words
stop <- paste0(c("was", "but", "there"), collapse = "|")
I would like to go through the Text from ID and remove words from stop pattern It is important to keep order of words. I do not want to use merge functions.
I have tried this
df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words
for (i in length(df$Words)){
df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
}
But this gives me a vector of logical string not a list of words.
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium FALSE, FALSE, FALSE, FALSE
2 2 clostridium difficile positive clostridium, difficile, positive FALSE, FALSE, FALSE
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
I would like to get this (replace all words from stop pattern and keep word order)
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium "REPLACED", "REPLACED", not, clostridium
2 2 clostridium difficile positive clostridium, difficile, positive clostridium, difficile, positive
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium
Upvotes: 0
Views: 450
Reputation: 336
You could use rflashtext
:
df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L))
library(rflashtext)
processor <- KeywordProcessor$new(keys = c("was", "but", "there"),
words = rep.int("REPLACED", 3L))
df$Clean <- strsplit(processor$replace_keys(df$Text), split = " ", fixed = TRUE)
df
ID Text
1 1 there was not clostridium
2 2 clostridium difficile positive
3 3 test was OK but there was clostridium
Clean
1 REPLACED, REPLACED, not, clostridium
2 clostridium, difficile, positive
3 test, REPLACED, OK, REPLACED, REPLACED, REPLACED, clostridium
Upvotes: 1
Reputation: 136
Tidyverse solution :
First, you need to modify the stop vector so i contains \b before and after the stop word. \b = word boundary and avoid removing the patterns accidentally from within words.
library(stringr)
library(dplyr)
stop <- paste0(c("\\bwas\\b", "\\bbut\\b", "\\bther\\b"), collapse = "|")
Then remove with str_remove_all. However, this will leave doble whitespaces, which can be removed with str_replace_all and change two whitespaces with one.
df %>% mutate(Words = str_remove_all(Text, stop)) %>%
mutate(Words = str_replace_all(Words, "\\s{2}", " "))
This yields the following results (added a "I was bit by a wasp" to check it didn't erase it.
# A tibble: 4 x 3
ID Text Words
<int> <chr> <chr>
1 1 there was not clostridium there not clostridium
2 2 clostridium difficile positive clostridium difficile positive
3 3 test was OK but there was clostridium test OK there clostridium
4 4 I was bit by a wasp I bit by a wasp
Upvotes: 1
Reputation: 456
You can use data.table
for it
df = as.data.table(df)[, clean := lapply(Words, function(x) gsub(stop, "REPLACED", x))]
Or you can use dplyr
(and don't create column Words):
df$clean = lapply(strsplit(df$Text, " "), function(x) gsub(stop, "REPLACED", x))
Upvotes: 1