Reputation: 139
I need to parse through regular expression a HTML string, were the KEY is located after the VALUE I need to extract.
Sample original string:
<TR><TD>VAL1</TD><TD>KEY1</TD></TR><TR><TD>VAL2</TD><TD>KEY2</TD></TR>
When I try to extract VAL2 with:
<TD>(.*?)</TD><TD>KEY2</TD>
I actually get
VAL1KEY1VAL2
How can I resolve this problem, assuming the Keys are constant and the values are changing?
Thanks in advance, Michael
Upvotes: 0
Views: 1152
Reputation: 46876
I don't know what language you're using, but if it's PHP, I think you'd be better off using DOM rather than parsing this using a regular expression.
Here's one way to do it:
<?php
$html="<TR><TD>VAL1</TD><TD>KEY1</TD></TR><TR><TD>VAL2</TD><TD>KEY2</TD></TR>";
$doc = new DOMDocument();
$doc->loadHTML($html);
print_r($doc->getElementById(1)->tagName);
$output=array();
$n=0;
while ($val = $doc->getElementsByTagName('td')->item($n++)) {
$key = $doc->getElementsByTagName('td')->item($n++);
$output[$key->textContent]=$val->textContent;
}
print_r($output);
And here's what it shows when I run it.
Array
(
[VAL1] => KEY1
[VAL2] => KEY2
)
Upvotes: 1
Reputation: 354794
Use
<TD>([^<]*)</TD><TD>KEY2</TD>
instead. Your lazy quantifier would ensure the shortest match but from the first matching position. In this case starting at the first <TD>
. The solution above sidesteps the problem by restricting the characters that can appear in a value, so it won't ever span multiple tags.
Upvotes: 5