mcjudd
mcjudd

Reputation: 1580

Regex in R: How to extract certain letters from alphanumeric element in character vector?

Say, for instance, I have the following character vector of alphanumeric elements that include state abbreviations somewhere within the element:

strings <- c("0001AZ226", "0001CA243", "0NA01CT134", "0001CT1NA", "0001ID112", "NAVA230")

How can I extract the letters, excluding NA? I.e.,

somefunction(strings)
[1] "AZ"  "CA"  "CT"  "CT"  "ID"  "VA"

I've used regular expressions before to remove all non-integers per element, but never to remove all numbers and just letters N and A.

This is what I tried, but it didn't work:

 sub(paste(LETTERS[c(2:13,15:26)], collapse = "|"), "", strings, fixed = TRUE)

Upvotes: 1

Views: 1056

Answers (4)

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

A simple solution:

gsub("\\d+|NA", "", strings)
# [1] "AZ" "CA" "CT" "CT" "ID" "VA"

Upvotes: 2

IRTFM
IRTFM

Reputation: 263331

The state dataset is available by default. Looks at:

 ?state

sts <- paste(state.abb,collapse="|")

sub(paste0( "(.+)(", sts, ")(.+)"), "\\2", strings)
[1] "AZ" "CA" "CT" "CT" "ID" "VA"

Somebody tried to edit this and put in a call to dput(states.abb) and then pasted that into a new assignment. Given that state is always available, that is completely unnecessary, hence my rejection. The only value I can see might be in suggesting that people actually look at the help page and in illustrating what state.abb looks like:

?state
dput(state.abb)
#c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", 
... snipped the rest.

Upvotes: 1

Hugh
Hugh

Reputation: 16090

Provided the states occur followed only by three characters.

strings.stripped <- gsub("([A-Z]{2}).{3}$", "\\1", strings)

Upvotes: 1

user557597
user557597

Reputation:

Can be done using looarounds.

 # (?i)(?:(?!na|(?<=n)(?=a))[a-z])+

 (?i)           # Case insensitive modifier (or use as regex flag)
 (?:            # Cluster group
      (?!            # Negative assertion
           na             # Not NA ahead
        |  (?<= n )       # Not N behind,
           (?= a )        # and A ahead (at this location) 
      )              # End Negative assertion
      [a-z]          # Safe, grab this single character
 )+             # End Cluster group, do 1 to many times

Matches only these "AZ" "CA" "CT" "CT" "ID" "VA"

Upvotes: 1

Related Questions