Paul B
Paul B

Reputation: 349

Regex Is Always Greedy

So I have the following HTML:

<td class="testing">
    <strong><span><a href="whatever">test</a></span></strong>
    <div class="body" id="id_1234">test</div>
</td>
<td class="testing">
    <strong><span><a href="whatever2">test</a></span></strong>
    <div class="body" id="id_5678">test</div>
</td>
<td class="testing2">
    <strong><span><a href="whatever2">test2</a></span></strong>
    <div class="body" id="id_9012">test</div>
</td>

And I have the following regex that tries to get both 1234 and 5678:

~class="testing">\s*?<strong>.*?<a href=".*?">test</a>.*?<div class="body" id="id_(.*)">~Us

However, this returns only 5678, and not both:

[1] => Array
    (
        [0] => 5678
    )

How could I make it use the shortest overall match? I already use the ? modifier after every .*, as well as the U modifier at the end.

Thanks!

Upvotes: 0

Views: 80

Answers (4)

Phil
Phil

Reputation: 164798

Using DOM and XPath

$html = <<<_HTML
<td class="testing">
    <strong><span><a href="whatever">test</a></span></strong>
    <div class="body" id="id_1234">test</div>
</td>
<td class="testing">
    <strong><span><a href="whatever2">test</a></span></strong>
    <div class="body" id="id_5678">test</div>
</td>
<td class="testing2">
    <strong><span><a href="whatever2">test2</a></span></strong>
    <div class="body" id="id_9012">test</div>
</td>
_HTML;

$doc = new DOMDocument;
$doc->loadHTML($html);
$xp = new DOMXpath($doc);
$divs = $xp->query('//td[@class="testing" and //a[normalize-space(text())="test"]]/div[@class="body" and starts-with(@id, "id_")]');

$ids = array();
foreach ($divs as $div) {
    $ids[] = substr($div->getAttribute('id'), 3);
}

Example here - http://codepad.viper-7.com/GbKIj2

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

The reason why your pattern doesn't work is the misunderstanding of the U modifier.

The U doesn't make all the quantifier ungreedy (or lazy). The U modifier is a switch, and when you use it:

1) all the greedy quantifiers become ungreedy (or lazy)

2) all the ungreedy (or lazy) quantifiers become greedy.

Since you use the U modifier in your pattern, the .*? is greedy.

Upvotes: 2

Orangepill
Orangepill

Reputation: 24645

This produces the results you are after:

<?php

$str = '<td class="testing">
    <strong><span><a href="whatever">test</a></span></strong>
    <div class="body" id="id_1234">test</div>
</td>
<td class="testing">
    <strong><span><a href="whatever2">test2</a></span></strong>
    <div class="body" id="id_5678">test</div>
</td>';

$matches = array();

preg_match_all('/id\="id_([0-9]+)\"/m', $str, $matches);

print_r($matches[1]);

Upvotes: 0

DevZer0
DevZer0

Reputation: 13535

You can use preg_match_all

preg_match_all("/id\=\"id_([0-9]+)\"/g", $html, $matches);

Upvotes: 0

Related Questions