PatraoPedro
PatraoPedro

Reputation: 197

Extracting part of string using regular expressions

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.

test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")

Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.

ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")

What I’m hoping to achieve is the following:

H987654
G789456
F12

Upvotes: 2

Views: 1318

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626835

You can use the following pattern with gsub:

> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] ""        "H987654" "G789456" "F12"     "" 

See the regex demo

This pattern matches:

  • ^ - start of a string
  • (?: - start of an alternation group with 2 alternatives:
    • WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
    • | - or
    • `.* - any 0+ characters
  • )$ - closing the alternation group and match the end of string with $.

With str_match from stringr, it is even tidier:

> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA        "H987654" "G789456" "F12"     NA       
> 

See another regex demo

If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").

Upvotes: 5

talat
talat

Reputation: 70266

If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:

sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12" 

Upvotes: 3

Related Questions