Difficulties with the function preg_match_all

Question

I would like to get back the number which is between span HTML tags. The number may change!


  ::before
  "
             24
          "
  ::after

I've tried the following code:

preg_match_all("#(.*?)#", $source, $nombre[$i]);

But it doesn't work.

Entire code:

$result=array();
$page = 201;
while ($page>=1) {
    $source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
    preg_match_all("#(.*?)#", $source, $nombre[$i]);
    $result = array_merge($result, $nombre[$i][1]);
    print("Page : ".$page ."
");
    $page-=25;
}
print_r ($nombre);

Gordon · Accepted Answer

Can do with

preg_match_all(
    '#[^\d]*(\d+)[^\d]*?#s', 
    $html, 
    $matches
);

which would capture any digits before the end of the span.

However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.

Hence the recommendation to use a DOM parser instead, e.g.

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);

$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(@class, "topic-count")]') as $node) {
    if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
        echo $topics[0][0], PHP_EOL;
    }
}

DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression

//span[contains(@class, "topic-count")]

which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

Difficulties with the function preg_match_all

Answers (1)

Related Questions