R Ban
R Ban

Reputation: 97

Text Processing in R finding words

I am looking for efficient way to code the below . I am looking for anything that contains add and onion in the text then it is Found and if it is not there then it not found. I want to get this in an efficient manner. I dont want to hard code all the combination in it . I am looking for if add and onion are there in the text then it is found.

word_check <- c("add get onion" ,
                 "add to onion",
                "add oil to onion",
                "add oils to onion" ,
                "add salt to onion" ,
                "add get onion" ,
                "add get onion", 
                "add get onion")

df <- as.data.frame(c("I can add get onion" ,
                      "we can add to onion",
                      "I love to add oil to onion",
                      "I may not add oils to onion" ,
                      "add salt to onion" ,
                      "add get onion" ,
                      "abc",
                      "def" ,
                      "ghi",
                      "jkl",
                      "add get onion", 
                      "add get onion","add oil to the vegetable", "add onion to the vegetable" ))
names(df)[1] <- "text"


pattern_word_check <- paste(word_check, collapse = "|")


df$New <- ifelse(str_detect(df$text, regex(pattern_word_check)),"Found","Not Found")```

Regards, R

Upvotes: 1

Views: 48

Answers (3)

Benjamin Schwetz
Benjamin Schwetz

Reputation: 643

Here is a solution using tidytext. For your concrete example, this may seem a bit like overkill, but using more highlevel functions like a tokenizer together with an inner_join makes the code more clear and easier to build on. (imo)

df <- as.data.frame(c("I can add get onion" ,
                      "we can add to onion",
                      "I love to add oil to onion",
                      "I may not add oils to onion" ,
                      "add salt to onion" ,
                      "add get onion" ,
                      "abc",
                      "def" ,
                      "ghi",
                      "jkl",
                      "add get onion", 
                      "add get onion","add oil to the vegetable", "add onion to the vegetable" ), stringsAsFactors = FALSE)
names(df)[1] <- "text"
library(dplyr)
library(tidytext)
df_words <- df %>% 
  unnest_tokens(output = word,
                input = text,
                 token = "words",
                drop = FALSE)
inner_join(
  df_words %>% filter(word == "add"),
  df_words %>% filter(word == "onion"),
  by = "text"
) %>% 
  select(text) %>% 
  distinct()
#>                          text
#> 1         I can add get onion
#> 2         we can add to onion
#> 3  I love to add oil to onion
#> 4 I may not add oils to onion
#> 5           add salt to onion
#> 6               add get onion
#> 7  add onion to the vegetable

Created on 2020-04-02 by the reprex package (v0.3.0)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389325

Since you want to check for only "onion" and "add" which can occur in any order, you could do.

df$New <- ifelse(grepl('.*add.*onion.*|.*onion.*add.*',df$text), "found", "not found")
#Faster option without ifelse
#df$New <- c('Not found', 'found')[grepl('.*add.*onion.*|.*onion.*add.*', df$text) + 1]
df

#                          text       New
#1          I can add get onion     found
#2          we can add to onion     found
#3   I love to add oil to onion     found
#4  I may not add oils to onion     found
#5            add salt to onion     found
#6                add get onion     found
#7                          abc not found
#8                          def not found
#9                          ghi not found
#10                         jkl not found
#11               add get onion     found
#12               add get onion     found

Upvotes: 0

linog
linog

Reputation: 6226

Maybe I misunderstood so I propose you when solution based on your pattern_word_check variable and another using only onion and add in the regex.

Anyway, I think you are looking for grepl. You have many ways to solve your problem.

data.table

A data.table solution, using conditional replacement, would be :

library(data.table)
setDT(df)
df[,'new' := "Not Found"]
df[grepl(pattern_word_check, text), new := "Found"]

If you only want to consider words with "onion" OR "add"

df[,'new' := "Not Found"]
df[grepl("(onion|add)", text), new := "Found"]

dplyr

A dplyr solution would be:

library(dplyr)
df %>% mutate(new = if_else(grepl(pattern_word_check, text), "Found", "Not Found"))

Note that if use if_else from dplyr package, not base ifelse.

If you only want to consider words with "onion" OR "add"

library(dplyr)
df %>% mutate(new = if_else(grepl("(onion|add)", text), "Found", "Not Found"))

Upvotes: 1

Related Questions