Reputation: 431
I have a dataframe "city_table" that looks like this:
+---+---------------------+
| | city |
+---+---------------------+
| 1 | Chicago-2234dxsw |
+---+---------------------+
| 2 | Chicago,IL |
+---+---------------------+
| 3 | Chicago |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx |
+---+---------------------+
| 6 | Atlanta- 1234xx |
+---+---------------------+
| 7 | Atlanta, GA |
+---+---------------------+
| 8 | Atlanta - 123456T |
+---+---------------------+
I have another city code lookup table "city_lookup" that looks like this:
+---+--------------+-----------+
| | city_name | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL | 001 |
+---+--------------+-----------+
| 2 | Atlanta, GA | 002 |
+---+--------------+-----------+
As you can see, city names in "city" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).
I would like the final table that, through matching first n characters (let's day, n=7) between city_table$city
vs. city_lookup$city_name
, return me the city code properly, sth like this:
+---+---------------------+-----------+
| | city_name | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw | 001 |
+---+---------------------+-----------+
| 2 | Chicago,IL | 001 |
+---+---------------------+-----------+
| 3 | Chicago | 001 |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001 |
+---+---------------------+-----------+
| 5 | Chicago_2133xx | 001 |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx | 002 |
+---+---------------------+-----------+
| 7 | Atlanta, GA | 002 |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T | 002 |
+---+---------------------+-----------+
I am doing this in R, preferably using tidyverse/dplyr. Thanks so much for your help!
Upvotes: 1
Views: 343
Reputation: 887731
We can create columns with substring
(as the OP asked in the question) and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
city_table %>%
mutate(city_sub = substring(city, 1, 7)) %>%
regex_left_join(city_lookup %>%
mutate(city_sub = substring(city_name, 1, 7)),
by = 'city_sub') %>%
select(city_name = city, city_code)
-output
# city_name city_code
#1 Chicago-2234dxsw 001
#2 Chicago,IL 001
#3 Chicago 001
#4 Chicago - 124421xsd 001
#5 Chicago_2133xx 001
#6 Atlanta- 1234xx 002
#7 Atlanta, GA 002
#8 Atlanta - 123456T 002
city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago",
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA",
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA,
-8L))
city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001",
"002")), class = "data.frame", row.names = c(NA, -2L))
Upvotes: 0
Reputation: 2945
Even better, as long as the characters after the full city names are always non-letters, you can match the entire city name as so:
city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001", "002"))
city_table %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
left_join(city_lookup %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
by = "city_clean") %>%
select(-city_clean, -city_name)
city city_code
<chr> <chr>
1 Chicago-2234dxsw 001
2 Chicago,IL 001
3 Atlanta - 123456T 002
Upvotes: 1