Centro
Centro

Reputation: 4022

Group not working in regex

I wrote a regex for finding id values of html elements:

<.+ id\s*=\s*["'](.+)["'].*/?>

For most cases it returns id values, but not for this one:

<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />

It matches the following group value:

__EVENTTARGET" value="

instead of the expected __EVENTTARGET.

What is wrong in the regex?

Upvotes: 3

Views: 122

Answers (3)

Ed Heal
Ed Heal

Reputation: 60037

The '+' is greedy!, gets to the 'id' then after consuming the = then the " it wants more to eat. It can then gorge itself until it reaches the final " and spit that out for you.

Is that Pizza ready yet dear!

Upvotes: 1

Scott Rippey
Scott Rippey

Reputation: 15810

Your expression (.+) is "greedy" -- it matches as much as it can.

There are 2 solutions:

"Lazy" (non-greedy): this will match as few characters as possible

(.+?)

or a better solution, instead of matching . you should match [^'"]:

([^'"]+)

Upvotes: 2

Mark Byers
Mark Byers

Reputation: 839074

Regular expressions aren't the best tool for parsing HTML.

You could try making it non-greedy:

<.+ id\s*=\s*["'](.+?)["'].*/?>
                    ^

However it can still fail on other examples. It would be better to use an HTML parser, such as HTML Agility Pack.

Upvotes: 3

Related Questions