ssan
ssan

Reputation: 301

Extract matched keyword for the text

Looking for some help on extracting the keywords from the text. I have two data frame. The first dataframe has description column and the other data frame has only one column with the keywords.

I want to search the keywords from the dataframe2 on the description field and create a new column in dataframe1 with the matched keywords. If there are multiple keywords I need the newly added column with all the keywords seperated with comma as mentioned below.

Dataframe2

Keywords
New
FUND
EVENT 
Author
book

Dataframe1

ID  NAME    Month   DESCRIPTION              Keywords
12  x1       Jan    funding recived            fund
23  x2       Feb    author of the book     author, book
14  x3       Mar    new year event         new, event

Also, I need the keywords even the description has complete word. I.e Funding where I can get the keyword fund in the new column.

Upvotes: 2

Views: 371

Answers (2)

MKR
MKR

Reputation: 20095

A solution can be to use stringr::str_detect to check presence of Keywords in each DESCRIPTION.

library(stringr)

df1$Keywords <- mapply(function(x)paste(df2$Keywords[str_detect(x, tolower(df2$Keywords))],
                                        collapse = ","), df1$DESCRIPTION)

df1
#   ID NAME Month        DESCRIPTION    Keywords
# 1 12   x1   Jan    funding recived        FUND
# 2 23   x2   Feb author of the book Author,book
# 3 14   x3   Mar     new year event   New,EVENT

Data:

df1 <- read.table(text = 
"ID  NAME    Month   DESCRIPTION      
12  x1       Jan    'funding recived'   
23  x2       Feb    'author of the book'
14  x3       Mar    'new year event'",
header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = 
"Keywords
New
FUND
EVENT 
Author
book",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 1

akrun
akrun

Reputation: 887241

We can use regex_left_join from fuzzyjoin and do a group_by concatenation (paste)

library(fuzzyjoin)
library(dplyr)
df1 %>% 
   regex_left_join(df2, by = c('DESCRIPTION' = 'Keywords'), 
              ignore_case = TRUE) %>% 
   group_by(ID, NAME, Month, DESCRIPTION) %>% 
   summarise(Keywords = toString(unique(tolower(Keywords))))
# A tibble: 3 x 5
# Groups:   ID, NAME, Month [?]
#     ID NAME  Month DESCRIPTION        Keywords    
#  <int> <chr> <chr> <chr>              <chr>       
#1    12 x1    Jan   funding recived    fund        
#2    14 x3    Mar   new year event     new, event  
#3    23 x2    Feb   author of the book author, book

data

df1 <- structure(list(ID = c(12L, 23L, 14L), NAME = c("x1", "x2", "x3"
), Month = c("Jan", "Feb", "Mar"), DESCRIPTION = c("funding recived", 
"author of the book", "new year event")), .Names = c("ID", "NAME", 
"Month", "DESCRIPTION"), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(Keywords = c("New", "FUND", "EVENT", "Author", 
"book")), .Names = "Keywords", class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 4

Related Questions