Reputation: 5441

Why is my Regular Expression so Lazy?

Why is this regular expression so lazy? It is supposed to back reference a height/width property, stuff in between (optional), and then another height/width property (optional). It only gets the first property and then quits even when it could match more.

((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?

sample code on regexpal

Upvotes: 1

Answers (1)

Robert P

Reputation: 15978

The easiest way to see what's going on is to break it out into extended format. In extended format, your regex...

((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?

then becomes (with comments, which are legal in extended format):

(                     # a group that captures...
    (?:height|width)  # Height or width
    =                 # The Equals sign
    ["']              # a double quote or quote
    \d*               # zero or more digits 0-9
    ["']              # a double quote or quote
)                     # requried
(                     # zero or more groups that capture...space chars, 
    [\s\w:;'"=]       # letters, numbers, colon, quote, dobule quote, and equals 
)*?                   # zero or more times, lazily (giving up as much as it can)
(                     # a group that...
    (?:height|width)  # height or width
    =                 # colon
    ["']              # double quote or quote
    \d*               # zero or more numbers
    ["']              # double quote or quote
)?                    # optionally

So your regex may generate 1 group, and up to N groups, depending on the regex engine you're using. Your final group will be the group you want, if it's there. Remove the lazy modifier of the second group (the ?) and make the second group non-capturing, like so:

(                     # a group that captures...
    (?:height|width)  # Height or width (non capturing)
    =                 # The Equals sign
    ["']              # a double quote or quote
    \d*               # zero or more digits 0-9
    ["']              # a double quote or quote
)                     # requried
(?:                   # zero or more groups of space chars, letters, 
    [\s\w:;'"=]       # numbers, colon, quote, dobule quote, and equals 
)*                    # zero or more times as much as it can UNTIL...
(                     # a group that captures...
    (?:height|width)  # height or width (non-capturing)
    =                 # colon
    ["']              # double quote or quote
    \d*               # zero or more numbers
    ["']              # double quote or quote
)?                    # optional

and now the first and last tags will be in groups 1 and 2, respectively, with the stuff in the middle ignored. If there is that last one, it will be captured.

Note: It might not be capturing the last part because there's a character that needs to be captured in the middle group isn't specified. If there's, say, a comma, a # or any other kind of mark character, they're not specified by that middle group's character class. You could consider replacing that middle one with:

    ["']              # a double quote or quote
)                     # requried
.*                    # Anything, zero or more times, UNTIL...
(                     # a group that...
    (?:height|width)  # height or width (non-capturing)

and see if that DOES match. If it is, you may need to further enhance your middle group's character c lass.

If you don't care about how many matches occur in the middle group, just that you capture it, use a non-capturing group to capture each subset, and then a group to capture the entire collection of intermediate groups:

    ["']              # a double quote or quote
)                     # requried
(                     # a group that captures...
    (?:               # zero or more groups of space chars, letters, 
        [\s\w:;'"=]   # numbers, colon, quote, dobule quote, and equals 
    )*                # zero or more times as much as it can
)                     # UNTIL...
(                     # a group that captures...
    (?:height|width)  # height or width (non-capturing)

Now you will have a fixed number of captures, with the first part always in group 1, the middle stuff always in group 2, and the last (if it's there) in group 3.

Upvotes: 6

Why is my Regular Expression so Lazy?

Answers (1)

Related Questions