Reputation: 297
I have two df, each from a different source. They both reference the same information but one has an additional column that I am interested in.
head(df1)
DISTRICT_NAMES
<chr>
1 HUANUCO
2 SAN MARTIN DE PORRES
3 ANCON
4 RIMAC
5 JESUS MARIA
6 VILLA MARIA DEL TRIUNFO
#df1 have 1690 obs.
head(df2)
UBIGEO DISTRICT_NAMES POP
<dbl> <chr> <dbl>
1 10101 HUÁNUCO 32589
2 10102 SAN-MARTIN-DE-PORRES 262
3 10103 ANCÓN 1136
4 10104 RÍMAC 642
5 10105 JESÚS-MARÍA 585
6 10106 VILLA MARÍA DEL TF 1781
#df2 have 1874 obs.
As you can see, the 'DISTRICT_NAMES' column has similar values in both df, although not identical. As the data frames stand at the moment I would not be able to merge them through the merge
function, which is my ultimate intention.
My query is if there is a way to tell R: "if the value of one row matches the other in 3 out of 7 letters, assume they are the same and proceed with the merge".
Just in case, df1
is available here and df2
here
Upvotes: 0
Views: 230
Reputation: 78927
Here is a strategy that may help:
DISTRICT_NAMES_1
in df1
and df2
that replaces all non alphabetic characters with regex "[^[:alnum:]]"
by
to identify each row later.left_join
df1
merged with df2
where a match of new column DISTRICT_NAMES_1
occurs and if no match than you get an NA.Note I am not sure if this strategy will succeed but you can try!
library(dplyr)
df1_new <- df1 %>%
mutate(DISTRICT_NAMES_1 = str_replace_all(DISTRICT_NAMES, "[^[:alnum:]]", "")) %>%
arrange(DISTRICT_NAMES_1)
df2_new <- df2 %>%
mutate(DISTRICT_NAMES_1 = str_replace_all(DISTRICT_NAMES, "[^[:alnum:]]", "")) %>%
arrange(DISTRICT_NAMES_1)
df_result <- df1_new %>%
left_join(df2_new, by="DISTRICT_NAMES_1")
Output:
> df_result
# A tibble: 1,827 x 5
DISTRICT_NAMES.x DISTRICT_NAMES_1 UBIGEO DISTRICT_NAMES.y POP
<chr> <chr> <dbl> <chr> <dbl>
1 ABANCAY ABANCAY 30101 ABANCAY 69028
2 ABELARDO PARDO LEZAMETA ABELARDOPARDOLEZAMETA 20502 ABELARDO PARDO LEZAMETA 244
3 ACARI ACARI NA NA NA
4 ACAS ACAS 21402 ACAS 656
5 ACCHA ACCHA 81002 ACCHA 3028
6 ACCOMARCA ACCOMARCA 51102 ACCOMARCA 886
7 ACHAYA ACHAYA 210202 ACHAYA 2826
8 ACHOMA ACHOMA 40502 ACHOMA 841
9 ACO ACO 20902 ACO 379
10 ACO ACO 120202 ACO 1642
Upvotes: 2
Reputation: 388982
You can try with stringdist
joins adjusting the max_dist
argument as per your data and preference.
fuzzyjoin::stringdist_left_join(df1, df2, by = 'DISTRICT_NAMES', max_dist = 4)
# DISTRICT_NAMES.x UBIGEO DISTRICT_NAMES.y POP
#1 HUANUCO 10101 HUÁNUCO 32589
#2 SAN MARTIN DE PORRES 10102 SAN-MARTIN-DE-PORRES 262
#3 ANCON 10103 ANCÓN 1136
#4 RIMAC 10104 RÍMAC 642
#5 JESUS MARIA 10105 JESÚS-MARÍA 585
#6 VILLA MARIA DEL TRIUNFO NA <NA> NA
Upvotes: 0