Reputation: 29
I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.
tail(df$states_county_MSA,n=10)
[1] "Iowa Polk Des Moines"
[2] "Mississippi Hinds Jackson"
[3] "Georgia Richmond Augusta-Richmond"
[4] "Ohio Mahoning Youngstown-Warren-Boardman"
[5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[6] "Pennsylvania Dauphin Harrisburg-Carlisle"
[7] "Florida Brevard Palm Bay-Melbourne-Titusville"
[8] "Utah Utah Provo-Orem"
[9] "Tennessee Hamilton Chattanooga"
[10] "North Carolina Durham Durham"
Modifying the solution by @jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).
> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found
I was thinking of using the word function but names of MSAs are not standard either (one or more words).
Would anyone know a way to split the county-MSA in an efficient manner ?
EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.
[1] "Virginia Richmond Richmond 1,210,063 8 2014"
[2] "Louisiana Orleans New Orleans-Metairie-Kenner 1,195,794"
[3] "North Carolina Wake Raleigh-Cary 1,137,346 6 2014"
[4] "New York Erie Buffalo-Niagara Falls 1,135,342"
[5] "Alabama Jefferson Birmingham-Hoover 1,129,034"
[6] "Utah Salt Lake Salt Lake City 1,091,432 5 2014"
[7] "New York Monroe Rochester 1,080,082"
[8] "Michigan Kent Grand Rapids-Wyoming 989,205 7 2014"
[9] "Arizona Pima Tucson 981,935 10 2013"
[10] "Hawaii Honolulu Honolulu 956,336 8 2013"
Upvotes: 0
Views: 64
Reputation: 301
I think this should work:
data <- tibble::tribble(~state_county_msa,
"Iowa Polk Des Moines" ,
"Mississippi Hinds Jackson" ,
"Georgia Richmond Augusta-Richmond" ,
"Ohio Mahoning Youngstown-Warren-Boardman" ,
"Pennsylvania Lackawanna Scranton--Wilkes-Barre",
"Pennsylvania Dauphin Harrisburg-Carlisle" ,
"Florida Brevard Palm Bay-Melbourne-Titusville" ,
"Utah Utah Provo-Orem" ,
"Tennessee Hamilton Chattanooga" ,
"North Carolina Durham Durham")
state_county <- ggplot2::map_data("county") %>%
select(state = region,
county = subregion) %>%
as_tibble() %>%
mutate(across(everything(),str_to_title)) %>%
unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>%
distinct(state_county, .keep_all = TRUE)
state_county_string <- paste(state_county$state_county, collapse = "|")
data %>%
mutate(state_county = str_extract(state_county_msa, state_county_string),
msa = str_trim(str_remove(state_county_msa, state_county_string))) %>%
left_join(state_county, by = "state_county") %>%
select(state, county, msa)
Output:
# A tibble: 10 × 3
state county msa
<chr> <chr> <chr>
1 Iowa Polk Des Moines
2 Mississippi Hinds Jackson
3 Georgia Richmond Augusta-Richmond
4 Ohio Mahoning Youngstown-Warren-Boardman
5 Pennsylvania Lackawanna Scranton--Wilkes-Barre
6 Pennsylvania Dauphin Harrisburg-Carlisle
7 Florida Brevard Palm Bay-Melbourne-Titusville
8 Utah Utah Provo-Orem
9 Tennessee Hamilton Chattanooga
10 North Carolina Durham Durham
Upvotes: 1