Reputation: 5441
Why is this regular expression so lazy? It is supposed to back reference a height/width property, stuff in between (optional), and then another height/width property (optional). It only gets the first property and then quits even when it could match more.
((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?
Upvotes: 1
Views: 109
Reputation: 15978
The easiest way to see what's going on is to break it out into extended format. In extended format, your regex...
((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?
then becomes (with comments, which are legal in extended format):
( # a group that captures...
(?:height|width) # Height or width
= # The Equals sign
["'] # a double quote or quote
\d* # zero or more digits 0-9
["'] # a double quote or quote
) # requried
( # zero or more groups that capture...space chars,
[\s\w:;'"=] # letters, numbers, colon, quote, dobule quote, and equals
)*? # zero or more times, lazily (giving up as much as it can)
( # a group that...
(?:height|width) # height or width
= # colon
["'] # double quote or quote
\d* # zero or more numbers
["'] # double quote or quote
)? # optionally
So your regex may generate 1 group, and up to N groups, depending on the regex engine you're using. Your final group will be the group you want, if it's there. Remove the lazy modifier of the second group (the ?
) and make the second group non-capturing, like so:
( # a group that captures...
(?:height|width) # Height or width (non capturing)
= # The Equals sign
["'] # a double quote or quote
\d* # zero or more digits 0-9
["'] # a double quote or quote
) # requried
(?: # zero or more groups of space chars, letters,
[\s\w:;'"=] # numbers, colon, quote, dobule quote, and equals
)* # zero or more times as much as it can UNTIL...
( # a group that captures...
(?:height|width) # height or width (non-capturing)
= # colon
["'] # double quote or quote
\d* # zero or more numbers
["'] # double quote or quote
)? # optional
and now the first and last tags will be in groups 1 and 2, respectively, with the stuff in the middle ignored. If there is that last one, it will be captured.
Note: It might not be capturing the last part because there's a character that needs to be captured in the middle group isn't specified. If there's, say, a comma, a #
or any other kind of mark character, they're not specified by that middle group's character class. You could consider replacing that middle one with:
["'] # a double quote or quote
) # requried
.* # Anything, zero or more times, UNTIL...
( # a group that...
(?:height|width) # height or width (non-capturing)
and see if that DOES match. If it is, you may need to further enhance your middle group's character c lass.
If you don't care about how many matches occur in the middle group, just that you capture it, use a non-capturing group to capture each subset, and then a group to capture the entire collection of intermediate groups:
["'] # a double quote or quote
) # requried
( # a group that captures...
(?: # zero or more groups of space chars, letters,
[\s\w:;'"=] # numbers, colon, quote, dobule quote, and equals
)* # zero or more times as much as it can
) # UNTIL...
( # a group that captures...
(?:height|width) # height or width (non-capturing)
Now you will have a fixed number of captures, with the first part always in group 1, the middle stuff always in group 2, and the last (if it's there) in group 3.
Upvotes: 6