Jonny
Jonny

Reputation: 2927

REGEX Matching the whole HTML Document

So, I'm still a REGEX dummy and have only been using them for the past 2 days. However my problem seems odd, to me at least.

The following pattern correctly matches this string for me:

<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|\r\n|\n)+<td>(([a-z]|[A-Z]|=|\\s)+)<br>

Original String (taken from the html document which is being fed to the regex as input):

<td valign=3D"top">For:</td>     =             <td>XXXXXX XXXXX<br>

and the matched string:

<td valign=3D"top">For:</td>     =             <td>XXXXXX XXXXX<br>

However for this string:

<td valign=3D"top">For:</td>                     <td>YYYYYYY=     YYYYY<br>

it matched the entire html document. I don't understand why this is happening since after my (([a-z]|[A-Z]|=|\\s)+ I specified that there should be a <br> tag

Upvotes: 0

Views: 446

Answers (2)

Filipe Palrinhas
Filipe Palrinhas

Reputation: 1235

Parsing HTML with regex's is a very bad idea.

See why here: RegEx match open tags except XHTML self-contained tags

Even for parsing very simple things in HTML, using a DOM Parser is generally cleaner (more readable) and less error prone. Even more if you are new to REGEX's

Upvotes: 1

Andrew Cheong
Andrew Cheong

Reputation: 30293

Add the indicated question marks for non-greedy matching:

<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|\r\n|\n)+?<td>(([a-z]|[A-Z]|=|\\s)+?)<br>
                                                    ^                         ^

EDIT:

Further, you can simplify into a character class instead of using alternation:

<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|[\r\n])+?<td>([a-zA-Z=\\s]+?)<br>
                                           ^^^^^^        ^^^^^^^^^^^^

My only question is why your \\s is escaped while your \r\n are not...

EDIT 2:

Use * instead of + where, for example, spaces aren't mandatory; and non-greedy quantifiers are probably always helpful in these cases:

<td valign=3D\"top\">For:</td>(\\s)*?(=)?(.|[\r\n])*?<td>([a-zA-Z=\\s]*?)<br>
                                   ^^       ------ ^-     ------------^-

Upvotes: 2

Related Questions