Reputation: 656
I have two different datasets.
The first is called people
and is structured as such:
people <- structure(list(userID = c(175890530, 178691082, 40228319, 472555502,
1063565418, 242983504, 3253221155), bio = c("Living in Atlanta",
"Born in Seattle, resident of Phoenix", "Columbus, Ohio", "Bronx born and raised",
"What's up Chicago?!?!", "Product of Los Angeles, taxpayer in St. Louis",
"Go Dallas Cowboys!")), class = "data.frame", row.names = c(NA,
-7L))
The next is file called location
and is structured as such:
location <- structure(list(city = c("Atlanta", "Seattle", "Phoenix", "Columbus",
"Bronx", "Chicago", "Los Angeles", "St. Louis", "Dallas"), state = c("GA",
"WA", "AZ", "OH", "NY", "IL", "CA", "MO", "TX")), class = "data.frame", row.names = c(NA,
-9L))
What I am trying to do is run a "match" against the bio
field in the people
dataset where it matches the string against the city
field in the location
dataset.
While in theory I could just do something like:
mutate(city = str_extract_all(bio, "Atlanta|Seattle|Phoenix|Columbus|Bronx|Chicago|Los Angeles|St. Louis|St. Louis|Dallas"))
That won't actually work in practice because I'll be working with way more data and far more possible cities, so it can't be something hardcoded. I am looking for an output that is structured as such:
complete <- structure(list(userID = c(175890530, 178691082, 40228319, 472555502,
1063565418, 242983504, 3253221155), bio = c("Living in Atlanta",
"Born in Seattle, resident of Phoenix", "Columbus, Ohio", "Bronx born and raised",
"What's up Chicago?!?!", "Product of Los Angeles, taxpayer in St. Louis",
"Go Dallas Cowboys!"), city_return = c("Atlanta", "Seattle, Phoenix",
"Columbus", "Bronx", "Chicago", "Los Angeles, St. Louis", "Dallas"
)), class = "data.frame", row.names = c(NA, -7L))
The idea would be that it goes through each row in people$bio
and "matches" it against all the possibilities within location$city
and creates a new dataframe called complete
that has the fields userID
and bio
from the people
dataset along with a new column called city_return
which gives us the matches we're looking for.
Upvotes: 0
Views: 52
Reputation: 66570
library(tidyverse)
people %>%
separate_rows(bio) %>%
left_join(location, by = c("bio" = "city")) %>%
filter(!is.na(state))
This mostly works, but two problems:
"Atlana" not matched to "Atlanta" but might be possible with fuzzyjoin
but likely will yield false positives.
Not matching to Los Angeles since this only matches by single words. See below for an approach for two-word city names. You could run each of these and combine
# A tibble: 6 × 3
userID bio state
<dbl> <chr> <chr>
1 178691082 Seattle WA # two places mentioned for this user
2 178691082 Phoenix AZ # two places mentioned for this user
3 40228319 Columbus OH
4 472555502 Bronx NY
5 1063565418 Chicago IL
6 3253221155 Dallas TX
If we want to catch two-word cities, we could do something like this:
left_join(
people %>% tidytext::unnest_ngrams(bio, bio, n = 2),
location %>% tidytext::unnest_ngrams(bio, city, n = 2) %>%
filter(!is.na(bio))) %>%
filter(!is.na(state))
Result
Joining, by = "bio"
userID bio state
1 242983504 los angeles CA
2 242983504 st louis MO
You could bind_rows( [first code], [second code] )
to get the full output.
Upvotes: 1