Reputation: 1944

Locate 2-word U.S. states within string in R

This is pretty straight forward, but I've been wrestling with it for too long now. I've got several lists of word-salad strings of US states that I'm coalescing into a tidy dataframe. Part of this is identifying the US states that are within the strings. I cannot find a method to identify the 2-word states (e.g. 'New York') - I'm able to find the single word states (e.g. 'Florida'). Can you help me identify these words? The code I've got closest with is below.

I need to get the same string as output. The only difference being 2-name states being separated by an underscore (eg "New_York").

library(tidyverse)

search_string <- "                                        Stamps\nNevada                         61,455           82,713           12,832            95,545 $      1,670,735  $          1,461,634  $       3,132,369\nNew Hampshire                  67,586          194,207           39,225          233,432  $      2,287,792  $          1,372,421  $       3,660,213\nNew Jersey                     82,814          282,527          146,678          429,205  $      6,335,263  $          2,813,593  $       9,148,856\nNew Mexico                   111,188           379,489           81,056          460,545  $      4,653,064  $          8,789,532  $      13,442,596\nNew York                     696,679           679,458           74,731          754,189  $     13,193,942  $          5,298,613  $      18,492,555\nNorth Carolina               433,135           471,648           24,260          495,908  $      8,446,725  $          1,203,040  $       9,649,765\nNorth Dakota                 141,816           413,234          162,252          575,486  $      3,526,114  $          4,310,924  $       7,837,038\nOhio                         426,856         1,068,917           17,723        1,086,640  $     15,007,107  $          1,396,546  $      16,403,653\nOklahoma                     330,336           334,251           14,673          348,924  $      5,849,527  $          1,277,809  $       7,127,336\nOregon                       297,944         1,344,799           64,439        1,409,238  $     14,306,510  $          4,298,684  $      18,605,194\nPennsylvania               1,048,731         2,398,471          122,202        2,520,673  $     30,601,457  $          8,181,893  $      38,783,350\nRhode Island                   10,750           29,270            3,553            32,823 $         216,706 $              79,868 $         296,574\nSouth Carolina               279,203           207,379           58,527          265,906  $      3,241,468  $          4,117,974  $       7,359,442\nSouth Dakota                 216,152           247,222          100,706          347,928  $      5,588,964  $          9,431,150  $      15,020,114\nTennessee                    725,110         1,094,149           37,301        1,131,450  $     11,555,825  $          1,855,572  $      13,411,397\nTexas                      1,027,908         1,205,905           60,198        1,266,103  $     19,675,334  $          6,764,564  $      26,439,898\nUtah                         159,678           217,128           13,025          230,153  $      7,399,301  $          2,826,440  $      10,225,741\nVermont                        92,138          168,989           23,319          192,308  $      2,461,500  $          1,250,190  $       3,711,690\nVirginia                     314,748           774,910           48,213          823,123  $      8,800,321  $          2,369,762  $      11,170,083\nWashington                   198,162           780,794           10,718          791,512  $     10,837,451  $             744,633 $      11,582,084\nWest Virginia                288,098           656,091          174,657          830,748  $      4,831,265  $          6,227,285  $      11,058,550\nWisconsin                    689,099         2,472,489          127,017        2,599,506  $     24,942,778  $          6,534,212  $      31,476,990\nWyoming                      137,608           165,464           75,434          240,898  $      4,258,947  $         15,242,063  $      19,501,010\nTotal                     14,966,406       31,340,988         2,846,854       34,187,842 $     412,251,767  $        246,742,031  $     658,993,797\nU.S. Territories & DC"


 
search_string %>% 
  str_squish() %>% 
    str_subset('\\bWest Virginia\\b')

~ Edit

One approach using dplyr

search_string %>% 
  str_squish() %>% 
  str_split(' ') %>% 
  flatten_chr() %>% 
  as_tibble() %>% 
  mutate(lead = lead(value)) %>% 
  mutate(alfa = case_when(
    str_detect(value, 
               glue_collapse(
                 c('South',
                   'North', 
                   'New', 
                   'West', 
                   'Rhode'),
                 sep = '|')) ~ glue('{value}_{lead}'), 
    T ~ value
  )) %>% 
  pull(alfa)

Upvotes: 1

Answers (3)

Till

Reputation: 6663

Clean up

With gsub() we exclude all numbers, the $s and commas. Then we split on the newline character \n and get rid of extra whitespaces with str_squish().

a <- gsub("[0-9|\\$,]", " ", search_string) %>% 
  strsplit("\n", fixed = TRUE) %>% 
  .[[1]] %>% 
  str_squish()

Now we have a vector of all states

a
#>  [1] "Stamps"                "Nevada"                "New Hampshire"        
#>  [4] "New Jersey"            "New Mexico"            "New York"             
#>  [7] "North Carolina"        "North Dakota"          "Ohio"                 
#> [10] "Oklahoma"              "Oregon"                "Pennsylvania"         
#> [13] "Rhode Island"          "South Carolina"        "South Dakota"         
#> [16] "Tennessee"             "Texas"                 "Utah"                 
#> [19] "Vermont"               "Virginia"              "Washington"           
#> [22] "West Virginia"         "Wisconsin"             "Wyoming"              
#> [25] "Total"                 "U.S. Territories & DC"

And we can get the states with more than on letter by selecting those that have a white space in them with grep().

b <- a[grep(" ", a)]
b
#>  [1] "New Hampshire"         "New Jersey"            "New Mexico"           
#>  [4] "New York"              "North Carolina"        "North Dakota"         
#>  [7] "Rhode Island"          "South Carolina"        "South Dakota"         
#> [10] "West Virginia"         "U.S. Territories & DC"

Replacing white space in two-word states with underscore

We create a string vector containing the replacement strings and use mgsub::mgsub() to do the replacements.

c <- gsub(" ", "_", b)
c
#>  [1] "New_Hampshire"         "New_Jersey"            "New_Mexico"           
#>  [4] "New_York"              "North_Carolina"        "North_Dakota"         
#>  [7] "Rhode_Island"          "South_Carolina"        "South_Dakota"         
#> [10] "West_Virginia"         "U.S._Territories_&_DC"

library(mgsub)
mgsub(search_string, b, c)
#> [1] "                                        Stamps\nNevada                         61,455           82,713           12,832            95,545 $      1,670,735  $          1,461,634  $       3,132,369\nNew_Hampshire                  67,586          194,207           39,225          233,432  $      2,287,792  $          1,372,421  $       3,660,213\nNew_Jersey                     82,814          282,527          146,678          429,205  $      6,335,263  $          2,813,593  $       9,148,856\nNew_Mexico                   111,188           379,489           81,056          460,545  $      4,653,064  $          8,789,532  $      13,442,596\nNew_York                     696,679           679,458           74,731          754,189  $     13,193,942  $          5,298,613  $      18,492,555\nNorth_Carolina               433,135           471,648           24,260          495,908  $      8,446,725  $          1,203,040  $       9,649,765\nNorth_Dakota                 141,816           413,234          162,252          575,486  $      3,526,114  $          4,310,924  $       7,837,038\nOhio                         426,856         1,068,917           17,723        1,086,640  $     15,007,107  $          1,396,546  $      16,403,653\nOklahoma                     330,336           334,251           14,673          348,924  $      5,849,527  $          1,277,809  $       7,127,336\nOregon                       297,944         1,344,799           64,439        1,409,238  $     14,306,510  $          4,298,684  $      18,605,194\nPennsylvania               1,048,731         2,398,471          122,202        2,520,673  $     30,601,457  $          8,181,893  $      38,783,350\nRhode_Island                   10,750           29,270            3,553            32,823 $         216,706 $              79,868 $         296,574\nSouth_Carolina               279,203           207,379           58,527          265,906  $      3,241,468  $          4,117,974  $       7,359,442\nSouth_Dakota                 216,152           247,222          100,706          347,928  $      5,588,964  $          9,431,150  $      15,020,114\nTennessee                    725,110         1,094,149           37,301        1,131,450  $     11,555,825  $          1,855,572  $      13,411,397\nTexas                      1,027,908         1,205,905           60,198        1,266,103  $     19,675,334  $          6,764,564  $      26,439,898\nUtah                         159,678           217,128           13,025          230,153  $      7,399,301  $          2,826,440  $      10,225,741\nVermont                        92,138          168,989           23,319          192,308  $      2,461,500  $          1,250,190  $       3,711,690\nVirginia                     314,748           774,910           48,213          823,123  $      8,800,321  $          2,369,762  $      11,170,083\nWashington                   198,162           780,794           10,718          791,512  $     10,837,451  $             744,633 $      11,582,084\nWest_Virginia                288,098           656,091          174,657          830,748  $      4,831,265  $          6,227,285  $      11,058,550\nWisconsin                    689,099         2,472,489          127,017        2,599,506  $     24,942,778  $          6,534,212  $      31,476,990\nWyoming                      137,608           165,464           75,434          240,898  $      4,258,947  $         15,242,063  $      19,501,010\nTotal                     14,966,406       31,340,988         2,846,854       34,187,842 $     412,251,767  $        246,742,031  $     658,993,797\nU.S._Territories_&_DC"

^{Created on 2020-11-06 by the reprex package (v0.3.0)}

Upvotes: 2

Abdessabour Mtk

Reputation: 3888

This is not quite the exotic answer but with simple use of str_replace_all we get the desired output:

str_replace_all(search_string, c("(?<=\\n)(\\w+) (?=\\w+)"="\\1_"))

basically the state names are always preceded by a newline (?<=\\n), capture the first word (\\w+) then match a space then check if it's followed by another word (?=\\w+), only then replace (\\w+) with the captured group plus an underline \\1_

Upvotes: 2

Chris Ruehlemann

Reputation: 21432

Is this what you're looking for?

trimws(unlist(strsplit(gsub("\\bStamps\\b|\\bTotal\\b|\\n|\\d|\\$|,|\\s{2,},", "", search_string, perl = T), "\\s{2,}", perl = T)))
 [1] "Nevada"                "New Hampshire"         "New Jersey"            "New Mexico"           
 [5] "New York"              "North Carolina"        "North Dakota"          "Ohio"                 
 [9] "Oklahoma"              "Oregon"                "Pennsylvania"          "Rhode Island"         
[13] "South Carolina"        "South Dakota"          "Tennessee"             "Texas"                
[17] "Utah"                  "Vermont"               "Virginia"              "Washington"           
[21] "West Virginia"         "Wisconsin"             "Wyoming"               "U.S. Territories & DC"