Extracting counties and MSAs from state-county-MSA character variable

Question

I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.

tail(df$states_county_MSA,n=10)
 [1] "Iowa Polk Des Moines"                          
 [2] "Mississippi Hinds Jackson"                     
 [3] "Georgia Richmond Augusta-Richmond"             
 [4] "Ohio Mahoning Youngstown-Warren-Boardman"      
 [5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
 [6] "Pennsylvania Dauphin Harrisburg-Carlisle"      
 [7] "Florida Brevard Palm Bay-Melbourne-Titusville" 
 [8] "Utah Utah Provo-Orem"                          
 [9] "Tennessee Hamilton Chattanooga"                
[10] "North Carolina Durham Durham"

Modifying the solution by @jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.

library(tidyverse)

states_county_names_df <- data.frame(states_county = c(
  "California San Francisco",                
  "New York Bronx",                                                    
  "Illinois Cook",                            
  "Massachusetts Suffolk",                    
  "District of Columbia District of Columbia"
  )
)

data(state)

states_inc_Columbia <- c(state.name, "District of Columbia")

states_county_names_df %>%
  mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
         county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))

However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).

> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found

I was thinking of using the word function but names of MSAs are not standard either (one or more words).

Would anyone know a way to split the county-MSA in an efficient manner ?

EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.

 [1] "Virginia         Richmond           Richmond                                1,210,063    8         2014"
 [2] "Louisiana        Orleans            New Orleans-Metairie-Kenner             1,195,794"                  
 [3] "North Carolina   Wake               Raleigh-Cary                            1,137,346     6        2014"
 [4] "New York         Erie               Buffalo-Niagara Falls                   1,135,342"                  
 [5] "Alabama          Jefferson          Birmingham-Hoover                       1,129,034"                  
 [6] "Utah             Salt Lake          Salt Lake City                          1,091,432     5        2014"
 [7] "New York         Monroe             Rochester                               1,080,082"                  
 [8] "Michigan         Kent               Grand Rapids-Wyoming                      989,205     7        2014"
 [9] "Arizona          Pima               Tucson                                    981,935    10        2013"
[10] "Hawaii           Honolulu           Honolulu                                  956,336     8        2013"

Extracting counties and MSAs from state-county-MSA character variable

Answers (1)

Related Questions