Robert Almgren
Robert Almgren

Reputation: 821

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:

^([a-z]+)(?:c|)

The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.

I have also tried putting the desired target inside both of the alternatives:

^([a-z]+)c|^([a-z]+)

I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.

I am doing this in R, so I can use either the POSIX or the Perl regex library.

(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)

Upvotes: 0

Views: 162

Answers (4)

Gavin Jackson
Gavin Jackson

Reputation: 2055

Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...

This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...

  • (.+?) as few characters as possible
  • ([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
  • $ followed by the end of string

The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Upvotes: 0

BrodieG
BrodieG

Reputation: 52697

A variation on the non-greedy answers using base code only.

codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"  
# 
# [[2]]
# [1] "CLZ4" "CL"  
sapply(matched, `[[`, 2)  # extract just codes
# [1] "ZN" "CL"  

Upvotes: 0

Sven Hohenstein
Sven Hohenstein

Reputation: 81743

Here's a working regular expression:

vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")

sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 270298

Try this:

> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab"  "abd"

and even easier:

> sub("c$", "", c("abc", "abd"))
[1] "ab"  "abd"

Upvotes: 1

Related Questions