Reputation: 3845
I'm playing around with Goutte and can't get it to connect to a certain website. All other URLs seem to be working perfectly, and I'm struggling to understand what's preventing it from connecting. It just hangs until it times out after 30 seconds. If I remove the timeout, the same happens after 150 seconds.
Key points to note:
<?php
namespace App\Http\Controllers;
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
class ScraperController extends Controller
{
public function scrape()
{
$goutteClient = new Client();
$goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
$guzzleClient = new GuzzleClient(array(
'timeout' => 30,
'verify' => true,
'debug' => true,
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://www.tesco.com/');
dump($crawler);
/*$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});*/
}
}
This is the "debug" output, including the error:
* Trying 104.123.91.150:443... * TCP_NODELAY set * Connected to www.tesco.com (104.123.91.150) port 443 (#0) * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use http/1.1 * Server certificate: * subject: C=GB; L=Welwyn Garden City; jurisdictionC=GB; O=Tesco PLC; businessCategory=Private Organization; serialNumber=00445790; CN=www.tesco.com * start date: Feb 4 11:09:23 2020 GMT * expire date: Feb 3 11:39:21 2022 GMT * subjectAltName: host "www.tesco.com" matched cert's "www.tesco.com" * issuer: C=US; O=Entrust, Inc.; OU=See www.entrust.net/legal-terms; OU=(c) 2014 Entrust, Inc. - for authorized use only; CN=Entrust Certification Authority - L1M * SSL certificate verify ok. > GET / HTTP/1.1 Host: www.tesco.com user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 * old SSL session ID is stale, removing * Operation timed out after 30001 milliseconds with 0 bytes received * Closing connection 0
GuzzleHttp\Exception\ConnectException
cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
http://localhost/scrape
Can anyone see why I'm getting no response at all?
Upvotes: 3
Views: 1577
Reputation: 3845
Managed to resolve this by adding some more headers:
<?php
namespace App\Http\Controllers;
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
class ScraperController extends Controller
{
public function scrape()
{
$goutteClient = new Client();
$goutteClient->setHeader('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9');
$goutteClient->setHeader('accept-encoding', 'gzip, deflate, br');
$goutteClient->setHeader('accept-language', 'en-GB,en-US;q=0.9,en;q=0.8');
$goutteClient->setHeader('upgrade-insecure-requests', '1');
$goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
$goutteClient->setHeader('connection', 'keep-alive');
$guzzleClient = new GuzzleClient(array(
'timeout' => 5,
'verify' => true,
'debug' => true,
'cookies' => true,
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://www.tesco.com/');
dump($crawler);
/*$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});*/
}
}
Upvotes: 2