Tyler Rinker
Tyler Rinker

Reputation: 109844

split on last occurrence of digit, take 2nd part

If I have a string and want to split on the last digit and keep the last part of the split hpw can I do that?

x <- c("ID", paste0("X", 1:10, state.name[1:10]))

I'd like

 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

But would settle for:

 [1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

I can get the first part by:

unlist(strsplit(x, "[^0-9]*$"))

But want the second part.

Thank you in advance.

Upvotes: 6

Views: 639

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269481

gsubfn

Try this gsubfn solution:

> library(gsubfn)
> strapply(x, ".*\\d(\\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE)
 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

It matches the last digit followed by word characters and returns the word characters or if that fails it matches the end of line (to ensure that it matches something). If the first match succeeded then return it; otherwise, the back reference will be empty so return NA.

Note that the formula is a short hand way of writing the function function(z) if (nchar(z)) z else NA and that function could alternately replace the formula at the expense of a slightly more keystrokes.

gsub

A similar strategy could also work using just straight gsub but requires two lines and a marginally more complex regular expression. Here we use the second alternative to slurp up non-matches from the first alternative:

> s <- gsub(".*\\d(\\w*)|.*", "\\1", x)
> ifelse(nchar(s), s, NA)
 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

EDIT: minor improvements

Upvotes: 2

Andrie
Andrie

Reputation: 179408

You can do this one easy step with a regular expression:

gsub("(^.*\\d+)(\\w*)", "\\2", x)

Results in:

 [1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California"  "Colorado"    "Connecticut"
 [9] "Delaware"    "Florida"     "Georgia"  

What the regex does:

  1. "(^.*\\d+)(\\w*)": Look for two groups of characters.
    • The first group (^.*\\d+) looks for any digit followed by at least one number at the start of the string.
    • The second group \\w* looks for an alpha-numeric character.
  2. The "\\2" as the second argument to gsub() means to replace the original string with the second group that the regex found.

Upvotes: 4

thelatemail
thelatemail

Reputation: 93813

This seems a bit clunky, but it works:

state.pt2 <- unlist(strsplit(x,"^.[0-9]+"))
state.pt2[state.pt2!=""]

It would be nice to remove the ""'s generated by the match at the start of the string but I can't figure that out.

Here's another method using substr and gregexpr too that avoids having to subset the results:

substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))

Upvotes: 2

mnel
mnel

Reputation: 115382

library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))

gives

[1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California"  "Colorado"    "Connecticut" "Delaware"   
[10] "Florida"     "Georgia"

I would look at the documentation stringr for (most possibly) an even better approach.

Upvotes: 2

Related Questions