psudo
psudo

Reputation: 1558

Webscraping with Goutte and Guzzle

I have the following method from my controller that gets the data from the site:

$goutteClient = new Client();
$guzzleClient = new GuzzleClient([
   'timeout' => 60,
]);
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://html.duckduckgo.com/html/?q=Laravel');
$crawler->filter('.result__title .result__a')->each(function ($node) {
    dump($node->text());
});

The above code gives me the title of contents from the search results. I also want to get the link of the corresponding search result. That resides in class result__extras__url.

How do I filter the link in and the title at once? Or do I have to run another method for that?

Upvotes: 1

Views: 2651

Answers (2)

JoeGalind
JoeGalind

Reputation: 3805

For parsing, I usually do the following:

$doc = new DOMDocument();
$doc->loadHTML((string)$crawler->getBody());

from then on, you can access using getElementsByTagName functions on your DOMDocument.

for example:

$rows = $doc->getElementsByTagName('tr');
foreach ($rows as $row) {
    $cols = $row->getElementsByTagName('td');
    $value = trim($cols->item(0)->nodeValue);
}

You can find more information in https://www.php.net/manual/en/class.domdocument.php

Upvotes: 1

Sohel Aman
Sohel Aman

Reputation: 492

Try to inspect the attributes of the nodes. Once you get the href attribute, parse it to get the URL.

$crawler->filter('.result__title .result__a')->each(function ($node) {
    $parts = parse_url(urldecode($node->attr('href')));
    parse_str($parts['query'], $params);
    $url = $params['uddg']; // DDG puts their masked URL and places the actual URL as a query param.
    $title = $node->text();
});

Upvotes: 1

Related Questions