Henry
Henry

Reputation: 37

Unable to use regex to search in PHP?

I'm trying to get the code of a html document in specific tags.

My method works for some tags, but not all, and it not work for the tag's content I want to get.

Here is my code:

<html>
<head></head>
<body>
<?php 
     $url = "http://sf.backpage.com/MusicInstruction/";   
     $data = file_get_contents($url);
     $pattern = "/<div class=\"cat\">(.*)<\/div>/";
     preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
     var_dump($adsLinks);
     foreach ($adsLinks as $i) {
         echo "<div class='ads'>".$i[0]."</div>";
     } 

?>
</body>
</html>

The above code doesn't work, but it works when I change the $pattern into:

$pattern = "/<div class=\"date\">(.*)<\/div>/";

or

$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";

I can't see any different between these $pattern. Please help me find the error. Thanks.

Upvotes: 0

Views: 90

Answers (2)

shamittomar
shamittomar

Reputation: 46692

Use PHP DOM to parse HTML instead of regex.

For example in your case (code updated to show HTML):

$doc = new DOMDocument();
@$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');

for ($i = 0; $i < $nodes->length; $i ++)
{
    $x = $nodes->item($i);

    if($x->getAttribute('class') == 'cat');
        echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}

Upvotes: 4

Paul Dixon
Paul Dixon

Reputation: 300825

The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try

$pattern = "/<div class=\"cat\">(.*)<\/div>/s";

When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *

$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";

This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.

Upvotes: 2

Related Questions