Reputation: 1580
Say, for instance, I have the following character vector of alphanumeric elements that include state abbreviations somewhere within the element:
strings <- c("0001AZ226", "0001CA243", "0NA01CT134", "0001CT1NA", "0001ID112", "NAVA230")
How can I extract the letters, excluding NA? I.e.,
somefunction(strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"
I've used regular expressions before to remove all non-integers per element, but never to remove all numbers and just letters N and A.
This is what I tried, but it didn't work:
sub(paste(LETTERS[c(2:13,15:26)], collapse = "|"), "", strings, fixed = TRUE)
Upvotes: 1
Views: 1056
Reputation: 81683
A simple solution:
gsub("\\d+|NA", "", strings)
# [1] "AZ" "CA" "CT" "CT" "ID" "VA"
Upvotes: 2
Reputation: 263331
The state
dataset is available by default. Looks at:
?state
sts <- paste(state.abb,collapse="|")
sub(paste0( "(.+)(", sts, ")(.+)"), "\\2", strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"
Somebody tried to edit this and put in a call to dput(states.abb)
and then pasted that into a new assignment. Given that state
is always available, that is completely unnecessary, hence my rejection. The only value I can see might be in suggesting that people actually look at the help page and in illustrating what state.abb looks like:
?state
dput(state.abb)
#c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
... snipped the rest.
Upvotes: 1
Reputation: 16090
Provided the states occur followed only by three characters.
strings.stripped <- gsub("([A-Z]{2}).{3}$", "\\1", strings)
Upvotes: 1
Reputation:
Can be done using looarounds.
# (?i)(?:(?!na|(?<=n)(?=a))[a-z])+
(?i) # Case insensitive modifier (or use as regex flag)
(?: # Cluster group
(?! # Negative assertion
na # Not NA ahead
| (?<= n ) # Not N behind,
(?= a ) # and A ahead (at this location)
) # End Negative assertion
[a-z] # Safe, grab this single character
)+ # End Cluster group, do 1 to many times
Matches only these "AZ" "CA" "CT" "CT" "ID" "VA"
Upvotes: 1