Phani Shashank
Phani Shashank

Reputation: 98

Regular expression returns empty array in php even though the regular expression is correct

This is my regular expression:

$pattern_new="/<td>(\n|\s)*?(<span(\n|\s|.)*?<\/strong>(\n|\s)*?\$(?<price>([0-9.]*)).*?)\$(.*?)(\n|\s)*?</";

This is the sample pattern from which I have to do a match:

<td><strong>.zx</strong></td><td><span class="offer"><strong>xscre:<br></strong>$299 xxxxx&x;xx<span class="fineprint_number">2</span></span><br>de&ea;s $399</td><td>zxcddcdcdcdc</td></tr><tr class="dark"><td><strong>.aa.rr</strong></td><td><span class="offer"><strong>xscre:<br></strong>$99 xxxxx&x;xx<span class="fineprint_number">2</span></span><br>de&eae;s $199</td><td>xxxx</td></tr><tr class="bar"><td colspan="3"></td></tr><tr class="bright"><td><strong>.vfd</strong></td><td><span class="offer"><strong>xscre:<br></strong>$99 xxxxx&x;xx<span class="fineprint_number">2</span></span><br>du&ee;s $199</td><td>xxxxxxxx</td></tr><tr class="dark"><td><strong>.qwe</strong></td><td><span class="offer"><strong>xxx<br></strong>$99 xxxc;o<span class="fineprint_number">2</span>

Here is what I am doing in PHP

$pattern_new="/<td>(\n|\s)*?(<span(\n|\s|.)*?<\/strong>(\n|\s)*?\$(<price>)*([0-9.]*).*?)\$(.*?)(\n|\s)*?</";
$source = file_get_contents("https://www.abc.com/sources/data.txt");
preg_match_all($pattern_new, $source, $match_newprice, PREG_PATTERN_ORDER);
echo$source;
print_r($match_newprice);

the$match_newprice is returning an empty array.

When I am using a regex tester like myregextester or solmetra.com I am getting a perfect match no issues at all but when I am using php preg_match_all to do the match it is returning an empty array. I increased the pcre.backtrack_limit but its still the same issue. I don't seem to understand the problem. Any help would be much appreciated.

Upvotes: 1

Views: 297

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

The good way to do that:

$oProductsHTML = new DOMDocument();
@$oProductsHTML->loadHTML($sHtml);

$oSpanNodes = $oProductsHTML->getElementsByTagName('span');

foreach ($oSpanNodes as $oSpanNode) {
    if (preg_match('~\boffer\b~', $oSpanNode->getAttribute('class')) &&
        preg_match('~\$\K\d++~', $oSpanNode->nodeValue, $aMatch) )
    {
        $sPrice = $aMatch[0];
        echo '<br/>' . $sPrice;
    }
}

$sHtml stands for your string.

And i'm sure you can make it shorter with XPath.

The bad way:

$sPattern = '~<span class="offer\b(?>[^>]++|>(?!\$))+>\$\K\d++~';
preg_match_all($sPattern, $sHtml, $aMatches);

print_r ($aMatches[0]);

Notice: \d++ can be replaced by \d++(?>\.\d++)? to allow decimal numbers.

Upvotes: 1

Wrikken
Wrikken

Reputation: 70490

Another problem which is PHP related with this:

<?php
echo "\$".PHP_EOL;
echo '\$'.PHP_EOL;

Result:

$
\$

... as in double quoted strings the $ is expected to signify the start of a variable, and needs escaping if you mean a bare $. Put single quotes around your regex & it will probably be fine (haven't looked at in detail though, you may want to use the /x option & add some formatting whitespace/comments if you need to debug this a half year from now).

Upvotes: 1

Smern
Smern

Reputation: 19076

I assume you were trying to do a noncapture group for <price... but you missed the :. Or you should take out the question mark. If the price group is optional, try like the regex below. You should use the following website to help you with regex. I find it extremely helpful.

<td>(\n|\s)*?(<span(\n|\s|.)*?<\/strong>(\n|\s)*?\$(<price>)*([0-9.]*).*?)\$(.*?)(\n|\s)*?<

Regular expression image

Edit live on Debuggex

In the above example, your first match would have the following captures:

0: "<td><span class="offer"><strong>xscre:<br></strong>$299 xxxxx&x;xx<span class="fineprint_number">2</span></span><br>de&ea;s $399<"
1: ""
2: "<span class="offer"><strong>xscre:<br></strong>$299 xxxxx&x;xx<span class="fineprint_number">2</span></span><br>de&ea;s "
3: ">"
4: ""
5: ""
6: "299"
7: "399"
8: ""

Is this what you are looking for?

Upvotes: 2

Related Questions