John Clegg
John Clegg

Reputation: 109

How to extract state names from a string

This seem obvious but I can't figure it out. I have a vector of characters containing state names alongside random other words and would like to extract the state name.

df <- data.frame(string = c("The quick brown Arizona","jumps over the Alabama","dog Arkansas"))

I can create extract state names individually:

df$state[grepl("Alabama",df$string)] <- "Alabama"

but I can't figure out how to replicate that for all states without copying and pasting it 42 times. The closest I got was:

find.state <- function(x){
   df$state[grepl(x,df$string)] <- x
}
lapply(state.name, find.state)

but that just prints all the state names.

Upvotes: 1

Views: 1869

Answers (3)

Jared
Jared

Reputation: 3570

R comes with a variable holding the state names, state.name. Use paste to collapse it into one long character element, with | separating each state. This can be used as the search pattern for a regular expression.

library(stringr)
str_extract(df$string, paste(state.name, collapse='|'))

Upvotes: 3

MKR
MKR

Reputation: 20095

One option in the sample data provided by OP can be as:

gsub(".*\\s(\\w+)$","\\1",df$string)
#[1] "Arizona"  "Alabama"  "Arkansas"

Regex:

.*\s     - Look for anything followed by `space`
(\\w+)$  - Look for word character following last space till end. This will be state name.

Upvotes: 0

G5W
G5W

Reputation: 37661

You can do this with a somewhat awkward regular expression.

df$state = sub(".*\\b(Arizona|Alabama|Arkansas)\\b.*", "\\1", df$string)
df
                   string    state
1 The quick brown Arizona  Arizona
2  jumps over the Alabama  Alabama
3            dog Arkansas Arkansas

Of course, you need to include the names of all the states, not just these three. So you might build that as a pattern first.

Pattern = paste0(paste0(".*\\b(", paste0(state.name, collapse="|")), ")\\b.*")
df$state = sub(Pattern, "\\1", df$string)

Upvotes: 6

Related Questions