Reputation: 2927
So, I'm still a REGEX dummy and have only been using them for the past 2 days. However my problem seems odd, to me at least.
The following pattern correctly matches this string for me:
<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|\r\n|\n)+<td>(([a-z]|[A-Z]|=|\\s)+)<br>
Original String (taken from the html document which is being fed to the regex as input):
<td valign=3D"top">For:</td> = <td>XXXXXX XXXXX<br>
and the matched string:
<td valign=3D"top">For:</td> = <td>XXXXXX XXXXX<br>
However for this string:
<td valign=3D"top">For:</td> <td>YYYYYYY= YYYYY<br>
it matched the entire html document. I don't understand why this is happening since after my (([a-z]|[A-Z]|=|\\s)+
I specified that there should be a <br>
tag
Upvotes: 0
Views: 446
Reputation: 1235
Parsing HTML with regex's is a very bad idea.
See why here: RegEx match open tags except XHTML self-contained tags
Even for parsing very simple things in HTML, using a DOM Parser is generally cleaner (more readable) and less error prone. Even more if you are new to REGEX's
Upvotes: 1
Reputation: 30293
Add the indicated question marks for non-greedy matching:
<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|\r\n|\n)+?<td>(([a-z]|[A-Z]|=|\\s)+?)<br>
^ ^
EDIT:
Further, you can simplify into a character class instead of using alternation:
<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|[\r\n])+?<td>([a-zA-Z=\\s]+?)<br>
^^^^^^ ^^^^^^^^^^^^
My only question is why your \\s
is escaped while your \r\n
are not...
EDIT 2:
Use *
instead of +
where, for example, spaces aren't mandatory; and non-greedy quantifiers are probably always helpful in these cases:
<td valign=3D\"top\">For:</td>(\\s)*?(=)?(.|[\r\n])*?<td>([a-zA-Z=\\s]*?)<br>
^^ ------ ^- ------------^-
Upvotes: 2