Reputation: 1944
This is pretty straight forward, but I've been wrestling with it for too long now. I've got several lists of word-salad strings of US states that I'm coalescing into a tidy dataframe. Part of this is identifying the US states that are within the strings. I cannot find a method to identify the 2-word states (e.g. 'New York') - I'm able to find the single word states (e.g. 'Florida'). Can you help me identify these words? The code I've got closest with is below.
I need to get the same string as output. The only difference being 2-name states being separated by an underscore (eg "New_York").
library(tidyverse)
search_string <- " Stamps\nNevada 61,455 82,713 12,832 95,545 $ 1,670,735 $ 1,461,634 $ 3,132,369\nNew Hampshire 67,586 194,207 39,225 233,432 $ 2,287,792 $ 1,372,421 $ 3,660,213\nNew Jersey 82,814 282,527 146,678 429,205 $ 6,335,263 $ 2,813,593 $ 9,148,856\nNew Mexico 111,188 379,489 81,056 460,545 $ 4,653,064 $ 8,789,532 $ 13,442,596\nNew York 696,679 679,458 74,731 754,189 $ 13,193,942 $ 5,298,613 $ 18,492,555\nNorth Carolina 433,135 471,648 24,260 495,908 $ 8,446,725 $ 1,203,040 $ 9,649,765\nNorth Dakota 141,816 413,234 162,252 575,486 $ 3,526,114 $ 4,310,924 $ 7,837,038\nOhio 426,856 1,068,917 17,723 1,086,640 $ 15,007,107 $ 1,396,546 $ 16,403,653\nOklahoma 330,336 334,251 14,673 348,924 $ 5,849,527 $ 1,277,809 $ 7,127,336\nOregon 297,944 1,344,799 64,439 1,409,238 $ 14,306,510 $ 4,298,684 $ 18,605,194\nPennsylvania 1,048,731 2,398,471 122,202 2,520,673 $ 30,601,457 $ 8,181,893 $ 38,783,350\nRhode Island 10,750 29,270 3,553 32,823 $ 216,706 $ 79,868 $ 296,574\nSouth Carolina 279,203 207,379 58,527 265,906 $ 3,241,468 $ 4,117,974 $ 7,359,442\nSouth Dakota 216,152 247,222 100,706 347,928 $ 5,588,964 $ 9,431,150 $ 15,020,114\nTennessee 725,110 1,094,149 37,301 1,131,450 $ 11,555,825 $ 1,855,572 $ 13,411,397\nTexas 1,027,908 1,205,905 60,198 1,266,103 $ 19,675,334 $ 6,764,564 $ 26,439,898\nUtah 159,678 217,128 13,025 230,153 $ 7,399,301 $ 2,826,440 $ 10,225,741\nVermont 92,138 168,989 23,319 192,308 $ 2,461,500 $ 1,250,190 $ 3,711,690\nVirginia 314,748 774,910 48,213 823,123 $ 8,800,321 $ 2,369,762 $ 11,170,083\nWashington 198,162 780,794 10,718 791,512 $ 10,837,451 $ 744,633 $ 11,582,084\nWest Virginia 288,098 656,091 174,657 830,748 $ 4,831,265 $ 6,227,285 $ 11,058,550\nWisconsin 689,099 2,472,489 127,017 2,599,506 $ 24,942,778 $ 6,534,212 $ 31,476,990\nWyoming 137,608 165,464 75,434 240,898 $ 4,258,947 $ 15,242,063 $ 19,501,010\nTotal 14,966,406 31,340,988 2,846,854 34,187,842 $ 412,251,767 $ 246,742,031 $ 658,993,797\nU.S. Territories & DC"
search_string %>%
str_squish() %>%
str_subset('\\bWest Virginia\\b')
~ Edit
search_string %>%
str_squish() %>%
str_split(' ') %>%
flatten_chr() %>%
as_tibble() %>%
mutate(lead = lead(value)) %>%
mutate(alfa = case_when(
str_detect(value,
glue_collapse(
c('South',
'North',
'New',
'West',
'Rhode'),
sep = '|')) ~ glue('{value}_{lead}'),
T ~ value
)) %>%
pull(alfa)
Upvotes: 1
Views: 149
Reputation: 6663
With gsub()
we exclude all numbers, the $
s and commas. Then we split on the newline character \n
and get rid of extra whitespaces with str_squish()
.
a <- gsub("[0-9|\\$,]", " ", search_string) %>%
strsplit("\n", fixed = TRUE) %>%
.[[1]] %>%
str_squish()
Now we have a vector of all states
a
#> [1] "Stamps" "Nevada" "New Hampshire"
#> [4] "New Jersey" "New Mexico" "New York"
#> [7] "North Carolina" "North Dakota" "Ohio"
#> [10] "Oklahoma" "Oregon" "Pennsylvania"
#> [13] "Rhode Island" "South Carolina" "South Dakota"
#> [16] "Tennessee" "Texas" "Utah"
#> [19] "Vermont" "Virginia" "Washington"
#> [22] "West Virginia" "Wisconsin" "Wyoming"
#> [25] "Total" "U.S. Territories & DC"
And we can get the states with more than on letter by selecting those that
have a white space in them with grep()
.
b <- a[grep(" ", a)]
b
#> [1] "New Hampshire" "New Jersey" "New Mexico"
#> [4] "New York" "North Carolina" "North Dakota"
#> [7] "Rhode Island" "South Carolina" "South Dakota"
#> [10] "West Virginia" "U.S. Territories & DC"
We create a string vector containing the replacement strings and use mgsub::mgsub()
to do the replacements.
c <- gsub(" ", "_", b)
c
#> [1] "New_Hampshire" "New_Jersey" "New_Mexico"
#> [4] "New_York" "North_Carolina" "North_Dakota"
#> [7] "Rhode_Island" "South_Carolina" "South_Dakota"
#> [10] "West_Virginia" "U.S._Territories_&_DC"
library(mgsub)
mgsub(search_string, b, c)
#> [1] " Stamps\nNevada 61,455 82,713 12,832 95,545 $ 1,670,735 $ 1,461,634 $ 3,132,369\nNew_Hampshire 67,586 194,207 39,225 233,432 $ 2,287,792 $ 1,372,421 $ 3,660,213\nNew_Jersey 82,814 282,527 146,678 429,205 $ 6,335,263 $ 2,813,593 $ 9,148,856\nNew_Mexico 111,188 379,489 81,056 460,545 $ 4,653,064 $ 8,789,532 $ 13,442,596\nNew_York 696,679 679,458 74,731 754,189 $ 13,193,942 $ 5,298,613 $ 18,492,555\nNorth_Carolina 433,135 471,648 24,260 495,908 $ 8,446,725 $ 1,203,040 $ 9,649,765\nNorth_Dakota 141,816 413,234 162,252 575,486 $ 3,526,114 $ 4,310,924 $ 7,837,038\nOhio 426,856 1,068,917 17,723 1,086,640 $ 15,007,107 $ 1,396,546 $ 16,403,653\nOklahoma 330,336 334,251 14,673 348,924 $ 5,849,527 $ 1,277,809 $ 7,127,336\nOregon 297,944 1,344,799 64,439 1,409,238 $ 14,306,510 $ 4,298,684 $ 18,605,194\nPennsylvania 1,048,731 2,398,471 122,202 2,520,673 $ 30,601,457 $ 8,181,893 $ 38,783,350\nRhode_Island 10,750 29,270 3,553 32,823 $ 216,706 $ 79,868 $ 296,574\nSouth_Carolina 279,203 207,379 58,527 265,906 $ 3,241,468 $ 4,117,974 $ 7,359,442\nSouth_Dakota 216,152 247,222 100,706 347,928 $ 5,588,964 $ 9,431,150 $ 15,020,114\nTennessee 725,110 1,094,149 37,301 1,131,450 $ 11,555,825 $ 1,855,572 $ 13,411,397\nTexas 1,027,908 1,205,905 60,198 1,266,103 $ 19,675,334 $ 6,764,564 $ 26,439,898\nUtah 159,678 217,128 13,025 230,153 $ 7,399,301 $ 2,826,440 $ 10,225,741\nVermont 92,138 168,989 23,319 192,308 $ 2,461,500 $ 1,250,190 $ 3,711,690\nVirginia 314,748 774,910 48,213 823,123 $ 8,800,321 $ 2,369,762 $ 11,170,083\nWashington 198,162 780,794 10,718 791,512 $ 10,837,451 $ 744,633 $ 11,582,084\nWest_Virginia 288,098 656,091 174,657 830,748 $ 4,831,265 $ 6,227,285 $ 11,058,550\nWisconsin 689,099 2,472,489 127,017 2,599,506 $ 24,942,778 $ 6,534,212 $ 31,476,990\nWyoming 137,608 165,464 75,434 240,898 $ 4,258,947 $ 15,242,063 $ 19,501,010\nTotal 14,966,406 31,340,988 2,846,854 34,187,842 $ 412,251,767 $ 246,742,031 $ 658,993,797\nU.S._Territories_&_DC"
Created on 2020-11-06 by the reprex package (v0.3.0)
Upvotes: 2
Reputation: 3888
This is not quite the exotic answer but with simple use of str_replace_all
we get the desired output:
str_replace_all(search_string, c("(?<=\\n)(\\w+) (?=\\w+)"="\\1_"))
basically the state names are always preceded by a newline (?<=\\n)
, capture the first word (\\w+)
then match a space
then check if it's followed by another word (?=\\w+)
, only then replace (\\w+)
with the captured group plus an underline \\1_
Upvotes: 2
Reputation: 21432
Is this what you're looking for?
trimws(unlist(strsplit(gsub("\\bStamps\\b|\\bTotal\\b|\\n|\\d|\\$|,|\\s{2,},", "", search_string, perl = T), "\\s{2,}", perl = T)))
[1] "Nevada" "New Hampshire" "New Jersey" "New Mexico"
[5] "New York" "North Carolina" "North Dakota" "Ohio"
[9] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
[13] "South Carolina" "South Dakota" "Tennessee" "Texas"
[17] "Utah" "Vermont" "Virginia" "Washington"
[21] "West Virginia" "Wisconsin" "Wyoming" "U.S. Territories & DC"
Upvotes: 1