José Carlos Rojas
José Carlos Rojas

Reputation: 297

Merging two data frames with non-identical rows in R

I have two df, each from a different source. They both reference the same information but one has an additional column that I am interested in.

head(df1)
  DISTRICT_NAMES               
  <chr>                  
1 HUANUCO                
2 SAN MARTIN DE PORRES   
3 ANCON                  
4 RIMAC                  
5 JESUS MARIA            
6 VILLA MARIA DEL TRIUNFO
#df1 have 1690 obs.

head(df2)
  UBIGEO DISTRICT_NAMES                 POP
   <dbl> <chr>                       <dbl>
1  10101 HUÁNUCO                     32589
2  10102 SAN-MARTIN-DE-PORRES          262
3  10103 ANCÓN                        1136
4  10104 RÍMAC                         642
5  10105 JESÚS-MARÍA                   585
6  10106 VILLA MARÍA DEL TF           1781
#df2 have 1874 obs.

As you can see, the 'DISTRICT_NAMES' column has similar values in both df, although not identical. As the data frames stand at the moment I would not be able to merge them through the merge function, which is my ultimate intention.

My query is if there is a way to tell R: "if the value of one row matches the other in 3 out of 7 letters, assume they are the same and proceed with the merge".

Just in case, df1 is available here and df2 here

Upvotes: 0

Views: 230

Answers (2)

TarJae
TarJae

Reputation: 78927

Here is a strategy that may help:

  1. create a new column DISTRICT_NAMES_1 in df1 and df2 that replaces all non alphabetic characters with regex "[^[:alnum:]]" by to identify each row later.
  2. then you can perform a left_join
  3. Now you got all the data in df1 merged with df2 where a match of new column DISTRICT_NAMES_1 occurs and if no match than you get an NA.
  4. Now you could do further data wrangling.

Note I am not sure if this strategy will succeed but you can try!

library(dplyr)

df1_new <- df1 %>% 
    mutate(DISTRICT_NAMES_1 = str_replace_all(DISTRICT_NAMES, "[^[:alnum:]]", "")) %>% 
    arrange(DISTRICT_NAMES_1)

df2_new <- df2 %>% 
    mutate(DISTRICT_NAMES_1 = str_replace_all(DISTRICT_NAMES, "[^[:alnum:]]", "")) %>% 
    arrange(DISTRICT_NAMES_1)

df_result <- df1_new %>% 
    left_join(df2_new, by="DISTRICT_NAMES_1") 

Output:

> df_result 
# A tibble: 1,827 x 5
   DISTRICT_NAMES.x        DISTRICT_NAMES_1      UBIGEO DISTRICT_NAMES.y          POP
   <chr>                   <chr>                  <dbl> <chr>                   <dbl>
 1 ABANCAY                 ABANCAY                30101 ABANCAY                 69028
 2 ABELARDO PARDO LEZAMETA ABELARDOPARDOLEZAMETA  20502 ABELARDO PARDO LEZAMETA   244
 3 ACARI                   ACARI                     NA NA                         NA
 4 ACAS                    ACAS                   21402 ACAS                      656
 5 ACCHA                   ACCHA                  81002 ACCHA                    3028
 6 ACCOMARCA               ACCOMARCA              51102 ACCOMARCA                 886
 7 ACHAYA                  ACHAYA                210202 ACHAYA                   2826
 8 ACHOMA                  ACHOMA                 40502 ACHOMA                    841
 9 ACO                     ACO                    20902 ACO                       379
10 ACO                     ACO                   120202 ACO                      1642

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

You can try with stringdist joins adjusting the max_dist argument as per your data and preference.

fuzzyjoin::stringdist_left_join(df1, df2, by = 'DISTRICT_NAMES', max_dist = 4)

#         DISTRICT_NAMES.x UBIGEO     DISTRICT_NAMES.y   POP
#1                 HUANUCO  10101              HUÁNUCO 32589
#2    SAN MARTIN DE PORRES  10102 SAN-MARTIN-DE-PORRES   262
#3                   ANCON  10103                ANCÓN  1136
#4                   RIMAC  10104                RÍMAC   642
#5             JESUS MARIA  10105          JESÚS-MARÍA   585
#6 VILLA MARIA DEL TRIUNFO     NA                 <NA>    NA

Upvotes: 0

Related Questions