Reputation: 307
I need to merge two datasets based on columns that contain names that don't exaclty match, sometimes because one of the columns has a missing name with respect to the other. For example, in one column I have "Martín Gallardo"
and in the other I have "Martín Ricardo Gallardo"
. Another problem is that in some first and last name appear reversed, like "Martín Gallardo"
in one and "Gallardo Martín"
in the other. How can I match this using R? My first thought was to use str_split
in both and assign each on one set to the one that matches more elements from the other set, but I'm not sure how to code this.
Thank you.
Edit: data looks something like this
A <- tibble(email=c("[email protected]","[email protected]"),
name=c("martin", "raul"), last_name=c("gallardo","gimenez"),
full_name=c("martin gallardo", "raul gimenez"))
A
# A tibble: 2 x 4
# email name last_name full_name
# <chr> <chr> <chr> <chr>
# 1 [email protected] martin gallardo martin gallardo
# 2 [email protected] raul gimenez raul gimenez
B <- tibble(email=c("[email protected]", "[email protected]"),
name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"),
full_name=c("martin ricardo gallardo", "gimenez raul"), other_data=c("A", "B"))
B
# A tibble: 2 x 5
# email name last_name full_name other_data
# <chr> <chr> <chr> <chr> <chr>
# 1 [email protected] martin ricardo gallardo martin ricardo gallardo A
# 2 [email protected] gimenez raul gimenez raul B
Upvotes: 1
Views: 726
Reputation: 21908
In order for these two data sets to be matched I first created a column nombre_completo2
in a restructured form of data set A
based on how nombre_completo
in data set A
partially match the same column in data set B
. Then I merged the two data sets so that the additional columns in data set B
is added to the restructured form of A
. This is how I interpreted your desired output in the first place so I hope it will be useful to you:
A <- tibble(email=c("[email protected]","[email protected]"),
name=c("martin", "raul"), last_name=c("gallardo","gimenez"),
nombre_completo=c("martin gallardo", "raul gimenez"))
B <- tibble(email=c("[email protected]", "[email protected]"),
name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"),
nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
other_data=c("A", "B"))
library(dplyr)
library(tidyr)
library(purrr)
A %>%
rowwise() %>%
mutate(nombre_completo2 = map_chr(nombre_completo,
~ B$nombre_completo
[str_detect(B$nombre_completo, str_sub(.x, 1L, 4L))])) %>%
inner_join(B, by = c("nombre_completo2" = "nombre_completo")) %>%
select(!ends_with(".y")) %>%
rename_with(~ str_replace(., ".x", ""), ends_with(".x"))
# A tibble: 2 x 6
# Rowwise:
email name last_name nombre_completo nombre_completo2 other_data
<chr> <chr> <chr> <chr> <chr> <chr>
1 [email protected] martin gallardo martin gallardo martin ricardo gallar~ A
2 [email protected] raul gimenez raul gimenez gimenez raul B
Upvotes: 1
Reputation: 3047
This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A. library(tidyverse)
A1 <- tibble(
nombre_completo = c("martin gallardo", "raul gimenez")
) %>%
mutate(
id_A = row_number()
)
B1 <- tibble(
nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
other_data=c("A", "B")
) %>%
mutate(
id_B = row_number()
)
A2 <- A1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words))
B2 <- B1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words)) %>%
select(name_words, id_B )
left_join(A2, B2, by = "name_words") %>%
group_by(nombre_completo, id_A, id_B) %>%
count() %>% ungroup() %>%
group_by(nombre_completo, id_A) %>%
slice_max(order_by = n) %>%
select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
left_join(B1, by = "id_B")
Upvotes: 2