mdb_ftl
mdb_ftl

Reputation: 443

How to extract matches from stringr::str_detect in R into a list vector

I am trying to perform the following search on a database of text.

Here is the sample database, df

df <- data.frame(

  id = c(1, 2, 3, 4, 5, 6), 
  name = c("john doe", "carol jones", "jimmy smith", 
           "jenny ruiz", "joey jones", "tim brown"), 
  place = c("reno nevada", "poland maine", "warsaw poland", 
           "trenton new jersey", "brooklyn new york", "atlanta georgia")

  )

I have a vector of strings which contains terms I am trying to find.

new_search <- c("poland", "jones")

I pass the vector to str_detect to find ANY of the strings in new_search in ANY of the columns in df and then return rows which match...

df %>% 
    filter_all(any_vars(str_detect(., paste(new_search, collapse = "|")))) 

Question... how can I extract the results of str_detect into a new column?
For each row which is returned... I would like to generate a list of the terms which were successfully matched and put them in a list or character vector (matched_terms)...something like this...

  id        name             place    matched_terms   
1  2 carol jones      poland maine   c("jones", "poland")
2  3 jimmy smith     warsaw poland   c("poland")
3  5  joey jones brooklyn new york   c("jones")


      

Upvotes: 1

Views: 1198

Answers (2)

Zhiqiang Wang
Zhiqiang Wang

Reputation: 6769

This is my naive solution:

new_search <- c("poland", "jones") %>% paste(collapse = "|")
df %>% 
  mutate(new_var = str_extract_all(paste(name, place), new_search))

Upvotes: 5

Ronak Shah
Ronak Shah

Reputation: 388982

You can extract all the patterns in multiple columns using str_extract_all, combine them into one column with unite. unite combines the column into one string hence the empty values are turned into "character(0)" which we remove using str_remove_all and keep only those rows that have any matched term.

library(tidyverse)

pat <- str_c(new_search, collapse = "|")

df %>%
  mutate(across(-id, ~str_extract_all(., pat), .names = '{col}_new')) %>% 
  unite(matched_terms, ends_with('new'), sep = ',') %>%
  mutate(matched_terms = str_remove_all(matched_terms, 
                         'character\\(0\\),?|,character\\(0\\)')) %>%
  filter(matched_terms != '')

#  id        name             place matched_terms
#1  2 carol jones      poland maine  jones,poland
#2  3 jimmy smith     warsaw poland        poland
#3  5  joey jones brooklyn new york         jones

Upvotes: 2

Related Questions