Reputation: 431
I have a table ("city_table") that looks like this:
+----+--------------------------+
| | city |
+----+--------------------------+
| 1 | Austin |
+----+--------------------------+
| 2 | Austin TX |
+----+--------------------------+
| 3 | Austin Texas |
+----+--------------------------+
| 4 | Austin, Texas |
+----+--------------------------+
| 5 | Austin (city) |
+----+--------------------------+
| 6 | Butte - Bozeman |
+----+--------------------------+
| 7 | Butte-Bozeman |
+----+--------------------------+
| 8 | Butte-Bozeman MT |
+----+--------------------------+
| 9 | Butte-Bozeman, Montana |
+----+--------------------------+
| 10 | Butte - Bozeman, Montana |
+----+--------------------------+
I have a city code lookup table ("city_code") that looks like this:
+-----------+------------------+
| city_code | city_name |
+-----------+------------------+
| 001 | Austin,TX |
+-----------+------------------+
| 002 | Butte-Bozeman,MT |
+-----------+------------------+
As you can see, city names in "city_table" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).
I would like the final table that, through approximate regex matching, return me the city code properly, sth like this:
+----+--------------------------+-----------+
| | city | city_code |
+----+--------------------------+-----------+
| 1 | Austin | 001 |
+----+--------------------------+-----------+
| 2 | Austin TX | 001 |
+----+--------------------------+-----------+
| 3 | Austin Texas | 001 |
+----+--------------------------+-----------+
| 4 | Austin, Texas | 001 |
+----+--------------------------+-----------+
| 5 | Austin (city) | 001 |
+----+--------------------------+-----------+
| 6 | Butte - Bozeman | 002 |
+----+--------------------------+-----------+
| 7 | Butte-Bozeman | 002 |
+----+--------------------------+-----------+
| 8 | Butte-Bozeman MT | 002 |
+----+--------------------------+-----------+
| 9 | Butte-Bozeman, Montana | 002 |
+----+--------------------------+-----------+
| 10 | Butte - Bozeman, Montana | 002 |
+----+--------------------------+-----------+
The real data has about 100 cities so I would need a systematic way to extract the key city name and do approximate/loose regex matching, match & found the city code and put in a new column. I am doing this in R, preferably using tidyverse/dplyr.
Thanks so much for your help!
Upvotes: 1
Views: 60
Reputation: 887223
We can use fuzzyjoin
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("city" = "city_name"),
method = 'soundex')
-output
# city city_code city_name
#1 Austin 001 Austin,TX
#2 Austin TX 001 Austin,TX
#3 Austin Texas 001 Austin,TX
#4 Austin, Texas 001 Austin,TX
#5 Austin (city) 001 Austin,TX
#6 Butte - Bozeman 002 Butte-Bozeman,MT
#7 Butte-Bozeman 002 Butte-Bozeman,MT
#8 Butte-Bozeman MT 002 Butte-Bozeman,MT
#9 Butte-Bozeman, Montana 002 Butte-Bozeman,MT
#10 Butte - Bozeman, Montana 002 Butte-Bozeman,MT
df1 <- structure(list(city = c("Austin", "Austin TX", "Austin Texas",
"Austin, Texas", "Austin (city)", "Butte - Bozeman", "Butte-Bozeman",
"Butte-Bozeman MT", "Butte-Bozeman, Montana", "Butte - Bozeman, Montana"
)), class = "data.frame", row.names = c(NA, -10L))
df2 <- structure(list(city_code = c("001", "002"), city_name = c("Austin,TX",
"Butte-Bozeman,MT")), class = "data.frame", row.names = c(NA,
-2L))
Upvotes: 2