DPatrick
DPatrick

Reputation: 431

R studio: how to add city code by doing approximate regex matching

I have a table ("city_table") that looks like this:

+----+--------------------------+
|    | city                     |
+----+--------------------------+
| 1  | Austin                   |
+----+--------------------------+
| 2  | Austin TX                |
+----+--------------------------+
| 3  | Austin Texas             |
+----+--------------------------+
| 4  | Austin, Texas            |
+----+--------------------------+
| 5  | Austin (city)            |
+----+--------------------------+
| 6  | Butte - Bozeman          |
+----+--------------------------+
| 7  | Butte-Bozeman            |
+----+--------------------------+
| 8  | Butte-Bozeman MT         |
+----+--------------------------+
| 9  | Butte-Bozeman, Montana   |
+----+--------------------------+
| 10 | Butte - Bozeman, Montana |
+----+--------------------------+

I have a city code lookup table ("city_code") that looks like this:

+-----------+------------------+
| city_code | city_name        |
+-----------+------------------+
| 001       | Austin,TX        |
+-----------+------------------+
| 002       | Butte-Bozeman,MT |
+-----------+------------------+

As you can see, city names in "city_table" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).

I would like the final table that, through approximate regex matching, return me the city code properly, sth like this:

+----+--------------------------+-----------+
|    | city                     | city_code |
+----+--------------------------+-----------+
| 1  | Austin                   | 001       |
+----+--------------------------+-----------+
| 2  | Austin TX                | 001       |
+----+--------------------------+-----------+
| 3  | Austin Texas             | 001       |
+----+--------------------------+-----------+
| 4  | Austin, Texas            | 001       |
+----+--------------------------+-----------+
| 5  | Austin (city)            | 001       |
+----+--------------------------+-----------+
| 6  | Butte - Bozeman          | 002       |
+----+--------------------------+-----------+
| 7  | Butte-Bozeman            | 002       |
+----+--------------------------+-----------+
| 8  | Butte-Bozeman MT         | 002       |
+----+--------------------------+-----------+
| 9  | Butte-Bozeman, Montana   | 002       |
+----+--------------------------+-----------+
| 10 | Butte - Bozeman, Montana | 002       |
+----+--------------------------+-----------+

The real data has about 100 cities so I would need a systematic way to extract the key city name and do approximate/loose regex matching, match & found the city code and put in a new column. I am doing this in R, preferably using tidyverse/dplyr.

Thanks so much for your help!

Upvotes: 1

Views: 60

Answers (1)

akrun
akrun

Reputation: 887223

We can use fuzzyjoin

library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("city" = "city_name"),
      method = 'soundex')

-output

#                        city city_code        city_name
#1                    Austin       001        Austin,TX
#2                 Austin TX       001        Austin,TX
#3              Austin Texas       001        Austin,TX
#4             Austin, Texas       001        Austin,TX
#5             Austin (city)       001        Austin,TX
#6           Butte - Bozeman       002 Butte-Bozeman,MT
#7             Butte-Bozeman       002 Butte-Bozeman,MT
#8          Butte-Bozeman MT       002 Butte-Bozeman,MT
#9    Butte-Bozeman, Montana       002 Butte-Bozeman,MT
#10 Butte - Bozeman, Montana       002 Butte-Bozeman,MT

data

df1 <- structure(list(city = c("Austin", "Austin TX", "Austin Texas", 
"Austin, Texas", "Austin (city)", "Butte - Bozeman", "Butte-Bozeman", 
"Butte-Bozeman MT", "Butte-Bozeman, Montana", "Butte - Bozeman, Montana"
)), class = "data.frame", row.names = c(NA, -10L))

df2 <- structure(list(city_code = c("001", "002"), city_name = c("Austin,TX", 
"Butte-Bozeman,MT")), class = "data.frame", row.names = c(NA, 
-2L))

Upvotes: 2

Related Questions