user35131
user35131

Reputation: 1134

I would like to match two datasets based on arbitrary address fields relative to each other using R

the first dataset df1

structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73", 
"Mulberry street", "Front street and Fulton", "62nd street ", 
"wythe street", "vanderbilt avenue", "South Beach avenue")), class = "data.frame", row.names = c(NA, 
-8L))

the second dataset df2

structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street", 
"Mulberry street", "Front street and close to Fulton", "south beach avenue", 
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
)), class = "data.frame", row.names = c(NA, -8L))

df1

ID|Address
1   Canal and Broadway
2   55 water street room number 73
3   Mulberry street
4   Front street and Fulton
5   62nd street 
6   wythe street
7   vanderbilt avenue
8   South Beach avenue

df2

ID2|Address
 1  Canal & Broadway
 2  Somewhere around 55 water street
 3  Mulberry street
 4  Front street and close to Fulton
 5  south beach avenue
 6  along wythe street on the southwest 
 7  vanderbilt ave
 8  62nd street

Is there a way to match and get this outcome. Note that the addresses are similar but not exact.

    ID|                    Address| ID2
1   Canal and Broadway               1
2   55 water street room number 73   2
3   Mulberry street                  3
4   Front street and Fulton          4
5   62nd street                      8
6   wythe street                     6
7   vanderbilt avenue                7
8   South Beach avenue               5

Upvotes: 0

Views: 52

Answers (2)

akrun
akrun

Reputation: 887841

We can use method = 'soundex

library(fuzzyjoin)
stringdist_left_join(df1, df2, by = 'Address', method = 'soundex')

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389235

As suggested by @r2evans look into fuzzyjoin package. This will not give you your expected output out-of-the-box but will get you started.

fuzzyjoin::stringdist_left_join(df1, df2, by = 'Address', max_dist = 5)

#  ID                      Address.x ID2          Address.y
#1  1             Canal and Broadway   1   Canal & Broadway
#2  2 55 water street room number 73  NA               <NA>
#3  3                Mulberry street   3    Mulberry street
#4  4        Front street and Fulton  NA               <NA>
#5  5                   62nd street    8        62nd street
#6  6                   wythe street   8        62nd street
#7  7              vanderbilt avenue   7     vanderbilt ave
#8  8             South Beach avenue   5 south beach avenue

You many need to play around with maxdist argument and apply additional rules to reach to the final output.

Upvotes: 3

Related Questions