elarry
elarry

Reputation: 531

Using regex to selectively extract substrings in R

Suppose I have the following strings:

string <- c(
  "DATE_OF_BIRTH_B1",
  "HEIGHT_BABY2",
  "WEIGHT_BABY_3",
  "OTHER_CONDITION_4",
  "OTHER_OPERATION_5"
)

How can I use regex in gsub() to extract:

In other words, my expected gsub() output is:

"DATE_OF_BIRTH_B", "HEIGHT_BABY", "WEIGHT_BABY"

I managed to use gsub("(.+_B[A-Z]*)_?[0-9]", "\\1", string) to extract the desired substrings from the first three strings, but it failed to excluded the last two strings.

Could anyone help to correct and improve my regex, with a bit of explanation? Many thanks!

Upvotes: 2

Views: 90

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

If you expect gsub (or sub, usually, in this case, you really should use a sub since you only expect a single replacement operation) to return a result of the replacement or an empty string, you need to follow this technique:

sub("...(<what_you_want_to_extract>)...|.+", "\\1", x)

That is, your regex is before | alternation operator that is followed with .+ that matches any one or more chars as many as possible.

So, in your case, assuming your regex is just what you need and meets all your requirements, you can use

> res <- sub("(.+_B[A-Z]*)_?[0-9]|.+", "\\1", string)
> res
[1] "DATE_OF_BIRTH_B" "HEIGHT_BABY"     "WEIGHT_BABY"     ""                ""      

If you need to remove empty items, just use

> res[nzchar(res)]
[1] "DATE_OF_BIRTH_B" "HEIGHT_BABY"     "WEIGHT_BABY"

Upvotes: 1

Paul
Paul

Reputation: 9087

Remove OTHER or the suffix.

gsub("^OTHER.*|_?[0-9]+$", "", string)
#> [1] "DATE_OF_BIRTH_B"
#> [2] "HEIGHT_BABY"    
#> [3] "WEIGHT_BABY"    
#> [4] ""               
#> [5] ""  

Or, if you specifically want capture groups, use a non-greedy capture.

gsub("(OTHER.*)?(.*?)_?[0-9]", "\\2", string)
#> [1] "DATE_OF_BIRTH_B"
#> [2] "HEIGHT_BABY"    
#> [3] "WEIGHT_BABY"    
#> [4] ""               
#> [5] "" 

Upvotes: 3

Related Questions