Reputation: 1807
I have two datasets, one with full names and one with first and last names.
library(tidyverse)
(x = tibble(fullname = c("Michael Smith",
"Elisabeth Brown",
"John-Henry Albert")))
#> # A tibble: 3 x 1
#> fullname
#> <chr>
#> 1 Michael Smith
#> 2 Elisabeth Brown
#> 3 John-Henry Albert
(y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown"))
#> # A tibble: 3 x 2
#> first last
#> <chr> <chr>
#> 1 Elisabeth Smith
#> 2 John Albert
#> 3 Roland Brown
I'd like to make a single boolean column that is true only if the first and last column is within the fullname column.
In essence, I'm looking for something like:
x %>%
mutate(fname_match = str_detect(fullname, paste0(y$first, collapse = "|")), ## correct
lname_match = str_detect(fullname, paste0(y$last, collapse = "|"))) ## correct
#> # A tibble: 3 x 3
#> fullname fname_match lname_match
#> <chr> <lgl> <lgl>
#> 1 Michael Smith FALSE TRUE
#> 2 Elisabeth Brown TRUE TRUE
#> 3 John-Henry Albert TRUE TRUE
But here if I took the columns with two TRUE
's Elisabeth Brown would be a false positive because the matching first name and last name are not in the same row.
My best idea so far is to combine the first and last column and search for this, but this creates a false negative for John-Henry
y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown") %>%
rowwise() %>%
mutate(longname = paste(first, last, sep = "&"))
x %>%
mutate(full_match = str_detect(fullname, paste0(y$longname, collapse = "|")))
#> # A tibble: 3 x 2
#> fullname full_match
#> <chr> <lgl>
#> 1 Michael Smith FALSE
#> 2 Elisabeth Brown FALSE
#> 3 John-Henry Albert FALSE
Upvotes: 2
Views: 639
Reputation: 1713
I think this does what you want, using purrr::map2
to iterate over the tuples of first
and last
.
library(dplyr)
library(purrr)
y %>%
mutate(
name_match = map2_lgl(
first, last,
.f = ~any(grepl(paste0(.x, '.*', .y), x$fullname, ignore.case = T))
)
)
Do mind, paste0(.x, '.*', .y)
combines them into a regex that only lets rows pass in which the last name appears fully after the first. That seemed reasonable to do (otherwise, first name "Elisabeth", last name "Abe" would still be TRUE, which I here assume you would not want).
Also, the above is case insensitive.
// UPDATE:
I forgot; inversely, if you want to check the fullname
values in x
, then you can run this:
x %>%
rowwise() %>%
mutate(
name_match = any(map2_lgl(
y$first, y$last,
.f = ~grepl(paste0('\\b', .x, '\\b.*\\b', .y, '\\b'), fullname, ignore.case = T)
))
)
Depending on how important this check is for you and how many assumptions you want to make, it might make sense to tweak the above regex a little further:
paste0('\\b', .x, '\\b.*\\b', .y, '\\b')
paste0('^', .x, '\\b.*\\b', .y, '\\b')
paste0('\\b', .x, '\\b.*\\b', .y, '$')
Upvotes: 2