Reputation:
I am trying to write HTML parser with the help of Goutte. It works very well. However Goutte uses blocking requests. This works well if you are dealing with a single service. If I want to query lots of services which are independent from each other, this causes a problem. Goutte uses BrowserKit and Guzzle. I have tried to change doRequest function but it failed with
Argument 1 passed to Symfony\Component\BrowserKit\CookieJar::updateFromResponse() must be an instance of Symfony\Component\BrowserKit\Response
protected function doRequest($request)
{
$headers = array();
foreach ($request->getServer() as $key => $val) {
$key = strtolower(str_replace('_', '-', $key));
$contentHeaders = array('content-length' => true, 'content-md5' => true, 'content-type' => true);
if (0 === strpos($key, 'http-')) {
$headers[substr($key, 5)] = $val;
}
// CONTENT_* are not prefixed with HTTP_
elseif (isset($contentHeaders[$key])) {
$headers[$key] = $val;
}
}
$cookies = CookieJar::fromArray(
$this->getCookieJar()->allRawValues($request->getUri()),
parse_url($request->getUri(), PHP_URL_HOST)
);
$requestOptions = array(
'cookies' => $cookies,
'allow_redirects' => false,
'auth' => $this->auth,
);
if (!in_array($request->getMethod(), array('GET', 'HEAD'))) {
if (null !== $content = $request->getContent()) {
$requestOptions['body'] = $content;
} else {
if ($files = $request->getFiles()) {
$requestOptions['multipart'] = [];
$this->addPostFields($request->getParameters(), $requestOptions['multipart']);
$this->addPostFiles($files, $requestOptions['multipart']);
} else {
$requestOptions['form_params'] = $request->getParameters();
}
}
}
if (!empty($headers)) {
$requestOptions['headers'] = $headers;
}
$method = $request->getMethod();
$uri = $request->getUri();
foreach ($this->headers as $name => $value) {
$requestOptions['headers'][$name] = $value;
}
// Let BrowserKit handle redirects
$promise = $this->getClient()->requestAsync($method,$uri,$requestOptions);
$promise->then(
function (ResponseInterface $response) {
return $this->createResponse($response);
},
function (RequestException $e) {
$response = $e->getResponse();
if (null === $response) {
throw $e;
}
}
);
$promise->wait();
}
How can I change Goutte\Client.php so that it does requests asynchronously? Is that is not possible, how can I run my scrappers which targets different endpoints simultaneously? Thanks
Upvotes: 1
Views: 1880
Reputation: 2047
Goutte is essentially a bridge between Guzzle and Symphony's Browserkit and DomCrawler.
The biggest drawback with using Goutte is that all requests are made sychronouslly
To complete things asychronously you will have to forego using Goutte and directly use Guzzle and DomCrawler.
For example:
$requests = [
new GuzzleHttp\Psr7\Request('GET', $uri[0]),
new GuzzleHttp\Psr7\Request('GET', $uri[1]),
new GuzzleHttp\Psr7\Request('GET', $uri[2]),
new GuzzleHttp\Psr7\Request('GET', $uri[3]),
new GuzzleHttp\Psr7\Request('GET', $uri[4]),
new GuzzleHttp\Psr7\Request('GET', $uri[5]),
new GuzzleHttp\Psr7\Request('GET', $uri[6]),
];
$client = new GuzzleHttp\Client();
$pool = new GuzzleHttp\Pool($client, $requests, [
'concurreny' => 5, //how many concurrent requests we want active at any given time
'fulfilled' => function ($response, $index) {
$crawler = new Symfony\Component\DomCrawler\Crawler(null, $uri[$index]);
$crawler->addContent(
$response->getBody()->__toString(),
$response->getHeader['Content-Type'][0]
);
},
'rejected' => function ($response, $index) {
// do something if the request failed.
},
]);
$promise = $pool->promise();
$promise->wait();
Upvotes: 1