Reputation: 1134
the first dataset df1
structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73",
"Mulberry street", "Front street and Fulton", "62nd street ",
"wythe street", "vanderbilt avenue", "South Beach avenue")), class = "data.frame", row.names = c(NA,
-8L))
the second dataset df2
structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street",
"Mulberry street", "Front street and close to Fulton", "south beach avenue",
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
)), class = "data.frame", row.names = c(NA, -8L))
df1
ID|Address
1 Canal and Broadway
2 55 water street room number 73
3 Mulberry street
4 Front street and Fulton
5 62nd street
6 wythe street
7 vanderbilt avenue
8 South Beach avenue
df2
ID2|Address
1 Canal & Broadway
2 Somewhere around 55 water street
3 Mulberry street
4 Front street and close to Fulton
5 south beach avenue
6 along wythe street on the southwest
7 vanderbilt ave
8 62nd street
Is there a way to match and get this outcome. Note that the addresses are similar but not exact.
ID| Address| ID2
1 Canal and Broadway 1
2 55 water street room number 73 2
3 Mulberry street 3
4 Front street and Fulton 4
5 62nd street 8
6 wythe street 6
7 vanderbilt avenue 7
8 South Beach avenue 5
Upvotes: 0
Views: 52
Reputation: 887841
We can use method = 'soundex
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = 'Address', method = 'soundex')
Upvotes: 1
Reputation: 389235
As suggested by @r2evans look into fuzzyjoin
package. This will not give you your expected output out-of-the-box but will get you started.
fuzzyjoin::stringdist_left_join(df1, df2, by = 'Address', max_dist = 5)
# ID Address.x ID2 Address.y
#1 1 Canal and Broadway 1 Canal & Broadway
#2 2 55 water street room number 73 NA <NA>
#3 3 Mulberry street 3 Mulberry street
#4 4 Front street and Fulton NA <NA>
#5 5 62nd street 8 62nd street
#6 6 wythe street 8 62nd street
#7 7 vanderbilt avenue 7 vanderbilt ave
#8 8 South Beach avenue 5 south beach avenue
You many need to play around with maxdist
argument and apply additional rules to reach to the final output.
Upvotes: 3