Reputation: 531
Suppose I have the following strings:
string <- c(
"DATE_OF_BIRTH_B1",
"HEIGHT_BABY2",
"WEIGHT_BABY_3",
"OTHER_CONDITION_4",
"OTHER_OPERATION_5"
)
How can I use regex in gsub()
to extract:
In other words, my expected gsub()
output is:
"DATE_OF_BIRTH_B", "HEIGHT_BABY", "WEIGHT_BABY"
I managed to use gsub("(.+_B[A-Z]*)_?[0-9]", "\\1", string)
to extract the desired substrings from the first three strings, but it failed to excluded the last two strings.
Could anyone help to correct and improve my regex, with a bit of explanation? Many thanks!
Upvotes: 2
Views: 90
Reputation: 626699
If you expect gsub
(or sub
, usually, in this case, you really should use a sub
since you only expect a single replacement operation) to return a result of the replacement or an empty string, you need to follow this technique:
sub("...(<what_you_want_to_extract>)...|.+", "\\1", x)
That is, your regex is before |
alternation operator that is followed with .+
that matches any one or more chars as many as possible.
So, in your case, assuming your regex is just what you need and meets all your requirements, you can use
> res <- sub("(.+_B[A-Z]*)_?[0-9]|.+", "\\1", string)
> res
[1] "DATE_OF_BIRTH_B" "HEIGHT_BABY" "WEIGHT_BABY" "" ""
If you need to remove empty items, just use
> res[nzchar(res)]
[1] "DATE_OF_BIRTH_B" "HEIGHT_BABY" "WEIGHT_BABY"
Upvotes: 1
Reputation: 9087
Remove OTHER
or the suffix.
gsub("^OTHER.*|_?[0-9]+$", "", string)
#> [1] "DATE_OF_BIRTH_B"
#> [2] "HEIGHT_BABY"
#> [3] "WEIGHT_BABY"
#> [4] ""
#> [5] ""
Or, if you specifically want capture groups, use a non-greedy capture.
gsub("(OTHER.*)?(.*?)_?[0-9]", "\\2", string)
#> [1] "DATE_OF_BIRTH_B"
#> [2] "HEIGHT_BABY"
#> [3] "WEIGHT_BABY"
#> [4] ""
#> [5] ""
Upvotes: 3