Alexandru Pietroiu
Alexandru Pietroiu

Reputation: 23

How can I skip or remove a list of html tags from my crawler object using Symfony DomCrawler Component and Goutte for Laravel 4?

This was my first attempt but it did not work.

$this->crawler = $client->request('GET', $this->url);
$document = new \DOMDocument('1.0', 'UTF-8');
$root = $document->appendChild($document->createElement('_root'));
$this->crawler->rewind();
$root->appendChild($document->importNode($this->crawler->current(), true));

$selectorsToRemove = ['script','p'];
foreach ($selectorsToRemove as $selector) {
   $crawlerInverse = $this->crawler->filter($selector);
   foreach ($crawlerInverse as $elementToRemove) {
      $parent = $elementToRemove->parentNode;
      $parent->removeChild($elementToRemove);
    }
}
$this->crawler->clear();
$this->crawler->add($document);

I want to get the "p" tags from this page http://www.amazon.com/dp/B00IOY8XWQ/ref=fs_kv and it seams that it has some js inside the paragraph so when I try to do $node->text(); it gets me the text and the js inside the "script" inside the "p". The structure is like this;

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<script>
 "JS CODE"
</script>
</p>

So I just want the Lorem ipsum text.

Upvotes: 2

Views: 2480

Answers (1)

user3942918
user3942918

Reputation: 26385

I took a look at DomCrawler and don't see a whole lot of purpose in it. It seems to just be wrapping around the already-plenty-easy-to-use DOM extension, so I'm going to take a short-cut and use that directly instead.

The example is short and simple, you should be able to adapt it more or less as-is. You've got a DOMDocument ready to go.


Example:

$html = <<<'HTML'
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<script>
 "JS CODE"
</script>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadXML($html);
$xpath = new DOMXPath($dom);

foreach ($xpath->query('//p/script') as $node) {
    $node->parentNode->removeChild($node);
}

echo $dom->saveXML();

Output:

<?xml version="1.0"?>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

</p>

Upvotes: 2

Related Questions