wizkids121
wizkids121

Reputation: 656

Running a "match" between two different sets of data

I have two different datasets.

The first is called people and is structured as such:

people <- structure(list(userID = c(175890530, 178691082, 40228319, 472555502, 
                                  1063565418, 242983504, 3253221155), bio = c("Living in Atlanta", 
                                                                              "Born in Seattle, resident of Phoenix", "Columbus, Ohio", "Bronx born and raised", 
                                                                              "What's up Chicago?!?!", "Product of Los Angeles, taxpayer in St. Louis", 
                                                                              "Go Dallas Cowboys!")), class = "data.frame", row.names = c(NA, 
                                                                                                                                          -7L))

The next is file called location and is structured as such:

location <- structure(list(city = c("Atlanta", "Seattle", "Phoenix", "Columbus", 
                                  "Bronx", "Chicago", "Los Angeles", "St. Louis", "Dallas"), state = c("GA", 
                                                                                                       "WA", "AZ", "OH", "NY", "IL", "CA", "MO", "TX")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                             -9L))

What I am trying to do is run a "match" against the bio field in the people dataset where it matches the string against the city field in the location dataset.

While in theory I could just do something like:

mutate(city = str_extract_all(bio, "Atlanta|Seattle|Phoenix|Columbus|Bronx|Chicago|Los Angeles|St. Louis|St. Louis|Dallas"))

That won't actually work in practice because I'll be working with way more data and far more possible cities, so it can't be something hardcoded. I am looking for an output that is structured as such:

complete <- structure(list(userID = c(175890530, 178691082, 40228319, 472555502, 
1063565418, 242983504, 3253221155), bio = c("Living in Atlanta", 
"Born in Seattle, resident of Phoenix", "Columbus, Ohio", "Bronx born and raised", 
"What's up Chicago?!?!", "Product of Los Angeles, taxpayer in St. Louis", 
"Go Dallas Cowboys!"), city_return = c("Atlanta", "Seattle, Phoenix", 
"Columbus", "Bronx", "Chicago", "Los Angeles, St. Louis", "Dallas"
)), class = "data.frame", row.names = c(NA, -7L))

The idea would be that it goes through each row in people$bio and "matches" it against all the possibilities within location$city and creates a new dataframe called complete that has the fields userID and bio from the people dataset along with a new column called city_return which gives us the matches we're looking for.

Upvotes: 0

Views: 52

Answers (1)

Jon Spring
Jon Spring

Reputation: 66570

library(tidyverse)
people %>%
  separate_rows(bio) %>%
  left_join(location, by = c("bio" = "city")) %>%
  filter(!is.na(state))

This mostly works, but two problems:

"Atlana" not matched to "Atlanta" but might be possible with fuzzyjoin but likely will yield false positives.

Not matching to Los Angeles since this only matches by single words. See below for an approach for two-word city names. You could run each of these and combine

# A tibble: 6 × 3
      userID bio      state
       <dbl> <chr>    <chr>
1  178691082 Seattle  WA    # two places mentioned for this user
2  178691082 Phoenix  AZ    # two places mentioned for this user
3   40228319 Columbus OH   
4  472555502 Bronx    NY   
5 1063565418 Chicago  IL    
6 3253221155 Dallas   TX 

If we want to catch two-word cities, we could do something like this:

left_join(
  people %>% tidytext::unnest_ngrams(bio, bio, n = 2),
  location %>% tidytext::unnest_ngrams(bio, city, n = 2) %>%
    filter(!is.na(bio))) %>%
  filter(!is.na(state))

Result

Joining, by = "bio"
     userID         bio state
1 242983504 los angeles    CA
2 242983504    st louis    MO

You could bind_rows( [first code], [second code] ) to get the full output.

Upvotes: 1

Related Questions