user4170419
user4170419

Reputation:

Async HTML parser with Goutte

I am trying to write HTML parser with the help of Goutte. It works very well. However Goutte uses blocking requests. This works well if you are dealing with a single service. If I want to query lots of services which are independent from each other, this causes a problem. Goutte uses BrowserKit and Guzzle. I have tried to change doRequest function but it failed with

Argument 1 passed to Symfony\Component\BrowserKit\CookieJar::updateFromResponse() must be an instance of Symfony\Component\BrowserKit\Response

 protected function doRequest($request)
    {
        $headers = array();
        foreach ($request->getServer() as $key => $val) {
            $key = strtolower(str_replace('_', '-', $key));
            $contentHeaders = array('content-length' => true, 'content-md5' => true, 'content-type' => true);
            if (0 === strpos($key, 'http-')) {
                $headers[substr($key, 5)] = $val;
            }
            // CONTENT_* are not prefixed with HTTP_
            elseif (isset($contentHeaders[$key])) {
                $headers[$key] = $val;
            }
        }

        $cookies = CookieJar::fromArray(
            $this->getCookieJar()->allRawValues($request->getUri()),
            parse_url($request->getUri(), PHP_URL_HOST)
        );

        $requestOptions = array(
            'cookies' => $cookies,
            'allow_redirects' => false,
            'auth' => $this->auth,
        );

        if (!in_array($request->getMethod(), array('GET', 'HEAD'))) {
            if (null !== $content = $request->getContent()) {
                $requestOptions['body'] = $content;
            } else {
                if ($files = $request->getFiles()) {
                    $requestOptions['multipart'] = [];

                    $this->addPostFields($request->getParameters(), $requestOptions['multipart']);
                    $this->addPostFiles($files, $requestOptions['multipart']);
                } else {
                    $requestOptions['form_params'] = $request->getParameters();
                }
            }
        }

        if (!empty($headers)) {
            $requestOptions['headers'] = $headers;
        }

        $method = $request->getMethod();
        $uri = $request->getUri();

        foreach ($this->headers as $name => $value) {
            $requestOptions['headers'][$name] = $value;
        }

        // Let BrowserKit handle redirects
            $promise = $this->getClient()->requestAsync($method,$uri,$requestOptions);
            $promise->then(
                function (ResponseInterface $response) {
                    return $this->createResponse($response);

                },
                function (RequestException $e) {
                    $response = $e->getResponse();
                    if (null === $response) {
                        throw $e;
                    }


                }



            );
        $promise->wait();

    }

How can I change Goutte\Client.php so that it does requests asynchronously? Is that is not possible, how can I run my scrappers which targets different endpoints simultaneously? Thanks

Upvotes: 1

Views: 1880

Answers (1)

Shaun Bramley
Shaun Bramley

Reputation: 2047

Goutte is essentially a bridge between Guzzle and Symphony's Browserkit and DomCrawler.

The biggest drawback with using Goutte is that all requests are made sychronouslly

To complete things asychronously you will have to forego using Goutte and directly use Guzzle and DomCrawler.

For example:

$requests = [
    new GuzzleHttp\Psr7\Request('GET', $uri[0]),
    new GuzzleHttp\Psr7\Request('GET', $uri[1]),
    new GuzzleHttp\Psr7\Request('GET', $uri[2]),
    new GuzzleHttp\Psr7\Request('GET', $uri[3]),
    new GuzzleHttp\Psr7\Request('GET', $uri[4]),
    new GuzzleHttp\Psr7\Request('GET', $uri[5]),
    new GuzzleHttp\Psr7\Request('GET', $uri[6]),
];

$client = new GuzzleHttp\Client();

$pool = new GuzzleHttp\Pool($client, $requests, [
    'concurreny' => 5, //how many concurrent requests we want active at any given time
    'fulfilled' => function ($response, $index) {
        $crawler = new Symfony\Component\DomCrawler\Crawler(null, $uri[$index]);
        $crawler->addContent(
            $response->getBody()->__toString(),
            $response->getHeader['Content-Type'][0]
        );        
    },
    'rejected' => function ($response, $index) {
        // do something if the request failed.
    },
]);

$promise = $pool->promise();
$promise->wait();

Upvotes: 1

Related Questions