Scrapy playwright only first page is scraped

Question

I'm using scrapy with scrapy_playwright (python). When I scrape a page it successfully extracts links from the first page, then it creates more pages, but nothing happens with those, they don't get scraped. The spider just shuts down. Does anyone know why?

Here is the code:

class ClientSideSiteSpider(CrawlSpider):
    name = "client-side-site"
    handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
    exclude_patterns: List[str] = []

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "ITEM_PIPELINES": {
            # more stuff...
        },
        "DOWNLOADER_MIDDLEWARES": {
            # more stuff...
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": os.environ.get("PROXY_TR_SERVER"),
                "username": os.environ.get("PROXY_TR_USER"),
                "password": os.environ.get("PROXY_TR_PASSWORD"),
            },
        }
    }

    playwright_meta = {
        "playwright": True,
        "playwright_include_page": True,
        "playwright_page_methods": [
            PageMethod("wait_for_timeout", 10000),
        ],
    }

    def __init__(
        self,
        start_url: str,
        # here there is some more stuff...,
        **kwargs: Any
    ):
        self.start_urls: List[str] = [start_url]
        # boring initializations removed...

        url_parsed = urlparse(start_url)
        allow_path = url_parsed.path
        self.rules = (
            Rule(
                LinkExtractor(allow=allow_path),
                callback="parse_item",
                follow=True,
            ),
        )

        super().__init__(**kwargs)

    def start_requests(self) -> Iterator[Request]:
        for url in self.start_urls:
            yield Request(url, meta=self.playwright_meta)

    def parse_start_url(self, response: Response) -> Dict[str, Any]:
        return self.parse_item(response)

    def parse_item(self, response: Response) -> Dict[str, Any]:
        return {
            "status": response.status,
            "file_urls": [response.url],
            "body": response._get_body(),
            "type": response.headers.get("Content-Type", ""),
            "latency": response.meta.get("download_latency"),
        }

    def process_request(self, request: Request):
        """ adding playwright headers to all requests... necessary? """
        request.meta.update(self.playwright_meta)
        return request

In the logs I see that the first page is successfully crawled (and all its links are followed), but the following ones aren't.

First page:

2022-05-12 14:28:14 [scrapy-playwright] DEBUG: Browser context started: 'default'
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: [Context=default] Request:  (resource type: document, referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request:  (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request:  (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request:  (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request:  (resource type: stylesheet, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/polyfills-es2015.77ed2742568a17467b11.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/main-es2015.33ff46eac5dca0dd9807.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/runtime-es2015.3896a8c3776f78100458.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request:  (resource type: script, referrer: https://discountcasino266.com/)
....

Following pages:

2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 2 (2 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 3 (3 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 4 (4 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 5 (5 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 6 (6 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 7 (7 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 8 (8 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 9 (9 for all contexts)
2022-05-12 14:28:18 [scrapy.core.engine] INFO: Closing spider (finished)

Scrapy playwright only first page is scraped

Answers (1)

Related Questions