John-Henry
John-Henry

Reputation: 1807

str_detect on multiple columns in the same row

I have two datasets, one with full names and one with first and last names.

library(tidyverse)
(x = tibble(fullname = c("Michael Smith", 
                         "Elisabeth Brown", 
                         "John-Henry Albert")))
#> # A tibble: 3 x 1
#>   fullname         
#>   <chr>            
#> 1 Michael Smith    
#> 2 Elisabeth Brown  
#> 3 John-Henry Albert

(y = tribble(~first, ~last,
            "Elisabeth", "Smith",
            "John", "Albert",
            "Roland", "Brown"))
#> # A tibble: 3 x 2
#>   first     last  
#>   <chr>     <chr> 
#> 1 Elisabeth Smith 
#> 2 John      Albert
#> 3 Roland    Brown

I'd like to make a single boolean column that is true only if the first and last column is within the fullname column.

In essence, I'm looking for something like:

x %>% 
  mutate(fname_match = str_detect(fullname, paste0(y$first, collapse = "|")), ## correct 
         lname_match = str_detect(fullname, paste0(y$last, collapse = "|"))) ## correct
#> # A tibble: 3 x 3
#>   fullname          fname_match lname_match
#>   <chr>             <lgl>       <lgl>      
#> 1 Michael Smith     FALSE       TRUE       
#> 2 Elisabeth Brown   TRUE        TRUE       
#> 3 John-Henry Albert TRUE        TRUE

But here if I took the columns with two TRUE's Elisabeth Brown would be a false positive because the matching first name and last name are not in the same row.

My best idea so far is to combine the first and last column and search for this, but this creates a false negative for John-Henry

y = tribble(~first, ~last,
            "Elisabeth", "Smith",
            "John", "Albert",
            "Roland", "Brown") %>% 
    rowwise() %>% 
    mutate(longname = paste(first, last, sep = "&"))


x %>% 
  mutate(full_match = str_detect(fullname, paste0(y$longname, collapse = "|")))
#> # A tibble: 3 x 2
#>   fullname          full_match
#>   <chr>             <lgl>     
#> 1 Michael Smith     FALSE     
#> 2 Elisabeth Brown   FALSE     
#> 3 John-Henry Albert FALSE

Upvotes: 2

Views: 639

Answers (1)

alex_jwb90
alex_jwb90

Reputation: 1713

I think this does what you want, using purrr::map2 to iterate over the tuples of first and last.

library(dplyr)
library(purrr)

y %>%
  mutate(
    name_match = map2_lgl(
      first, last, 
      .f = ~any(grepl(paste0(.x, '.*', .y), x$fullname, ignore.case = T))
    )
  )

Do mind, paste0(.x, '.*', .y) combines them into a regex that only lets rows pass in which the last name appears fully after the first. That seemed reasonable to do (otherwise, first name "Elisabeth", last name "Abe" would still be TRUE, which I here assume you would not want). Also, the above is case insensitive.

// UPDATE:
I forgot; inversely, if you want to check the fullname values in x, then you can run this:

x %>%
  rowwise() %>%
  mutate(
    name_match = any(map2_lgl(
      y$first, y$last,
      .f = ~grepl(paste0('\\b', .x, '\\b.*\\b', .y, '\\b'), fullname, ignore.case = T)
    ))
  )

Depending on how important this check is for you and how many assumptions you want to make, it might make sense to tweak the above regex a little further:

  1. ensure that the first name and last name stand as isolated words in the fullname
    -> paste0('\\b', .x, '\\b.*\\b', .y, '\\b')
  2. test that the first name comes right at the beginning
    -> paste0('^', .x, '\\b.*\\b', .y, '\\b')
  3. test that the fullname ends after the last name
    -> paste0('\\b', .x, '\\b.*\\b', .y, '$')

Upvotes: 2

Related Questions