Reputation: 4022
I wrote a regex for finding id values of html elements:
<.+ id\s*=\s*["'](.+)["'].*/?>
For most cases it returns id values, but not for this one:
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
It matches the following group value:
__EVENTTARGET" value="
instead of the expected __EVENTTARGET
.
What is wrong in the regex?
Upvotes: 3
Views: 122
Reputation: 60037
The '+' is greedy!, gets to the 'id' then after consuming the = then the " it wants more to eat. It can then gorge itself until it reaches the final " and spit that out for you.
Is that Pizza ready yet dear!
Upvotes: 1
Reputation: 15810
Your expression (.+)
is "greedy" -- it matches as much as it can.
There are 2 solutions:
"Lazy" (non-greedy): this will match as few characters as possible
(.+?)
or a better solution, instead of matching .
you should match [^'"]
:
([^'"]+)
Upvotes: 2
Reputation: 839074
Regular expressions aren't the best tool for parsing HTML.
You could try making it non-greedy:
<.+ id\s*=\s*["'](.+?)["'].*/?>
^
However it can still fail on other examples. It would be better to use an HTML parser, such as HTML Agility Pack.
Upvotes: 3