A_S_
A_S_

Reputation: 29

Extracting counties and MSAs from state-county-MSA character variable

I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.

tail(df$states_county_MSA,n=10)
 [1] "Iowa Polk Des Moines"                          
 [2] "Mississippi Hinds Jackson"                     
 [3] "Georgia Richmond Augusta-Richmond"             
 [4] "Ohio Mahoning Youngstown-Warren-Boardman"      
 [5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
 [6] "Pennsylvania Dauphin Harrisburg-Carlisle"      
 [7] "Florida Brevard Palm Bay-Melbourne-Titusville" 
 [8] "Utah Utah Provo-Orem"                          
 [9] "Tennessee Hamilton Chattanooga"                
[10] "North Carolina Durham Durham"

Modifying the solution by @jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.

library(tidyverse)

states_county_names_df <- data.frame(states_county = c(
  "California San Francisco",                
  "New York Bronx",                                                    
  "Illinois Cook",                            
  "Massachusetts Suffolk",                    
  "District of Columbia District of Columbia"
  )
)

data(state)

states_inc_Columbia <- c(state.name, "District of Columbia")

states_county_names_df %>%
  mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
         county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))

However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).

> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found

I was thinking of using the word function but names of MSAs are not standard either (one or more words).

Would anyone know a way to split the county-MSA in an efficient manner ?

EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.

 [1] "Virginia         Richmond           Richmond                                1,210,063    8         2014"
 [2] "Louisiana        Orleans            New Orleans-Metairie-Kenner             1,195,794"                  
 [3] "North Carolina   Wake               Raleigh-Cary                            1,137,346     6        2014"
 [4] "New York         Erie               Buffalo-Niagara Falls                   1,135,342"                  
 [5] "Alabama          Jefferson          Birmingham-Hoover                       1,129,034"                  
 [6] "Utah             Salt Lake          Salt Lake City                          1,091,432     5        2014"
 [7] "New York         Monroe             Rochester                               1,080,082"                  
 [8] "Michigan         Kent               Grand Rapids-Wyoming                      989,205     7        2014"
 [9] "Arizona          Pima               Tucson                                    981,935    10        2013"
[10] "Hawaii           Honolulu           Honolulu                                  956,336     8        2013"

Upvotes: 0

Views: 64

Answers (1)

Matias1905
Matias1905

Reputation: 301

I think this should work:

data <- tibble::tribble(~state_county_msa,
                "Iowa Polk Des Moines"                          ,
                "Mississippi Hinds Jackson"                     ,
                "Georgia Richmond Augusta-Richmond"             ,
                "Ohio Mahoning Youngstown-Warren-Boardman"      ,
                "Pennsylvania Lackawanna Scranton--Wilkes-Barre",
                "Pennsylvania Dauphin Harrisburg-Carlisle"      ,
                "Florida Brevard Palm Bay-Melbourne-Titusville" ,
                "Utah Utah Provo-Orem"                          ,
                "Tennessee Hamilton Chattanooga"                ,
                "North Carolina Durham Durham")

state_county <- ggplot2::map_data("county") %>% 
  select(state = region,
         county = subregion) %>% 
  as_tibble() %>% 
  mutate(across(everything(),str_to_title)) %>% 
  unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>% 
  distinct(state_county, .keep_all = TRUE)

state_county_string <- paste(state_county$state_county, collapse = "|")

data %>%
  mutate(state_county = str_extract(state_county_msa, state_county_string),
         msa = str_trim(str_remove(state_county_msa, state_county_string))) %>% 
  left_join(state_county, by = "state_county") %>% 
  select(state, county, msa)

Output:

# A tibble: 10 × 3
   state          county     msa                          
   <chr>          <chr>      <chr>                        
 1 Iowa           Polk       Des Moines                   
 2 Mississippi    Hinds      Jackson                      
 3 Georgia        Richmond   Augusta-Richmond             
 4 Ohio           Mahoning   Youngstown-Warren-Boardman   
 5 Pennsylvania   Lackawanna Scranton--Wilkes-Barre       
 6 Pennsylvania   Dauphin    Harrisburg-Carlisle          
 7 Florida        Brevard    Palm Bay-Melbourne-Titusville
 8 Utah           Utah       Provo-Orem                   
 9 Tennessee      Hamilton   Chattanooga                  
10 North Carolina Durham     Durham   

Upvotes: 1

Related Questions