mellymoo
mellymoo

Reputation: 77

Create a new column with the words of a list that can be found in a phrase in R

I have two dataframes - one contains a column of sentences/phrases and the other contains a list of Tag Words. I want to create a new column that displays the tag words that show up in that sentence/phrase.

Sentence <- c(1,2,3)
Description <- c("I like potatoes, tomatoes, and broccoli", "Carrots, Radishes, and Potatoes", "Thanksgiving is my favorite because of Turkey")
df <- data.frame(Sentence, Description)


Names <- c("Potatoes", "Tomatoes", "Broccoli", "Turkey", "Thanksgiving")
Freq <- c("67", "13", "12", "10", "10")
List <- data.frame(Names, Freq)


#OUTPUT
df$Tags <- c("Potatoes, Tomatoes, Broccoli", "Potatoes", "Turkey, Thanksgiving")
df

Upvotes: 1

Views: 200

Answers (2)

dylanjm
dylanjm

Reputation: 2101

You can leverage Tidyverse and the stringr library to match words in a sentence and extract them using the Names vector. There is a cleaner way to do this, but this will answer your question:

library(tidyverse)

Sentence <- c(1,2,3)
Description <- c("I like potatoes, tomatoes, and broccoli", "Carrots, Radishes, and Potatoes", "Thanksgiving is my favorite because of Turkey")
df <- data.frame(Sentence, Description)


Names <- c("Potatoes", "Tomatoes", "Broccoli", "Turkey", "Thanksgiving")

df %>% 
  mutate(tags = str_extract_all(str_to_lower(Description), 
                                glue::glue_collapse(str_to_lower(Names), sep = "|")))
#>   Sentence                                   Description
#> 1        1       I like potatoes, tomatoes, and broccoli
#> 2        2               Carrots, Radishes, and Potatoes
#> 3        3 Thanksgiving is my favorite because of Turkey
#>                           tags
#> 1 potatoes, tomatoes, broccoli
#> 2                     potatoes
#> 3         thanksgiving, turkey

Created on 2019-04-29 by the reprex package (v0.2.1)

Upvotes: 1

olooney
olooney

Reputation: 2483

The following seems to work OK:

library(magrittr)

word_hash <- new.env(hash=TRUE, parent=emptyenv())
for ( word in List$Names ) {
  word_hash[[ tolower(word) ]] = word
}

df$Tags <- df$Description %>% 
  tolower() %>% 
  (function(s) gsub("[^ a-z]", "", s)) %>% 
  strsplit(" ") %>% 
  sapply(function(words)
    paste0(unique(unlist(sapply(words, function(key) word_hash[[key]]))), collapse=", ")
  );

The use of an environment object is used to get an fast O(1) hash table without which this would be very slow for large vocabularies.

The gsub() line assumes that all your words will consist entirely of normal letters a-z with no punctuation or digits. You may need to adjust that line if some words contain other characters.

Likewise, the strsplit(" ") assumes that all of your words can be split of space, which is true for your test case. If they will sometimes be split on tabs, newlines, or other characters, you'll have to modify that a bit.

Doing a case-insensitive match while also keeping track of correct casing complicates the solution but is implicit in the test case you've written. If you don't care about that, you can simplify a bit.

This solution returns unique words in the order they were found in the original sentence. This appears to be closest to what you had in mind, although your last test case is in a different order. You could also consider wrapping the unique() in a sort() if you wants tags to be in a consistent order.

Upvotes: 0

Related Questions