Skatrix
Skatrix

Reputation: 139

Regular Expression - shortest match

I need to parse through regular expression a HTML string, were the KEY is located after the VALUE I need to extract.

Sample original string:

<TR><TD>VAL1</TD><TD>KEY1</TD></TR><TR><TD>VAL2</TD><TD>KEY2</TD></TR>

When I try to extract VAL2 with:

<TD>(.*?)</TD><TD>KEY2</TD>

I actually get

VAL1KEY1VAL2

How can I resolve this problem, assuming the Keys are constant and the values are changing?

Thanks in advance, Michael

Upvotes: 0

Views: 1152

Answers (2)

ghoti
ghoti

Reputation: 46876

I don't know what language you're using, but if it's PHP, I think you'd be better off using DOM rather than parsing this using a regular expression.

Here's one way to do it:

<?php

$html="<TR><TD>VAL1</TD><TD>KEY1</TD></TR><TR><TD>VAL2</TD><TD>KEY2</TD></TR>";

$doc = new DOMDocument();
$doc->loadHTML($html);

print_r($doc->getElementById(1)->tagName);

$output=array();
$n=0;
while ($val = $doc->getElementsByTagName('td')->item($n++)) {
  $key = $doc->getElementsByTagName('td')->item($n++);
  $output[$key->textContent]=$val->textContent;
}

print_r($output);

And here's what it shows when I run it.

Array
(
    [VAL1] => KEY1
    [VAL2] => KEY2
)

Upvotes: 1

Joey
Joey

Reputation: 354794

Use

<TD>([^<]*)</TD><TD>KEY2</TD>

instead. Your lazy quantifier would ensure the shortest match but from the first matching position. In this case starting at the first <TD>. The solution above sidesteps the problem by restricting the characters that can appear in a value, so it won't ever span multiple tags.

Upvotes: 5

Related Questions