How do I remove certain tags and contents with PHP, using PHPCrawler

Question

I am currently using PHPCrawler for some search functionality on a site. I need to remove some of the page elements from being indexed.

For example, I have used:

$doc_body = preg_replace('/(.*?)<\/li>/is', "", $doc_body);

to remove lists, because I don't want the lists in the results. This works exactly as it should.

Now, another thing I need to remove is the following:

all contents within

so for this I have tried:

   $doc_body = preg_replace('/(.*?)<\/div>/is', "", $doc_body);

Which produces an error because perhaps not every page has the div class example. So I have adapted it with the following code:

      if(strpos($doc_body,'')){
      $doc_body = preg_replace('/(.*?)<\/div>/is', "", $doc_body);
      }

That unfortunately does not work either! It doesn't produce an error, but it doesn't remove and all contents from the results.

This is my first time working with either phpcrawler or Domdocument...although I am not sure if my problem here has anything to do with them?

Daniel · Accepted Answer

I'd suggest you take a look at DOMDocument and XPath which is used to query the document model much like CSS does, but with a bit different syntax. W3Schools have a lightweight tutorial on XPath here.

Regular expressions is always a bad idea when parsing an entire document since it is both resource heavy and time consuming.

E.g, to find every div with the class "example" using XPath, you'd just query the document as such;

//div[@class="example"]

Then remove the nodes with the DOMDocument api and finally normalize, in order to get the final result.

$doc = new DOMDocument();
$xpath = new DOMXPath($doc);
$doc->loadHTML($html);

// Remove all the lists
foreach ($xpath->query("//ul | //ol") as $node) {
     $node->parentNode->removeChild($node);
}

// Remove all  nodes
foreach ($xpath->query("//div[@class='example']") as $node) {
     $node->parentNode->removeChild($node);
}

$doc->normalize();

// Get the final document for indexing
$html = $doc->saveHTML();

How do I remove certain tags and contents with PHP, using PHPCrawler

Answers (1)

Related Questions