Reputation: 2449
I'm using scrapy with scrapy_playwright (python). When I scrape a page it successfully extracts links from the first page, then it creates more pages, but nothing happens with those, they don't get scraped. The spider just shuts down. Does anyone know why?
Here is the code:
class ClientSideSiteSpider(CrawlSpider):
name = "client-side-site"
handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
exclude_patterns: List[str] = []
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"ITEM_PIPELINES": {
# more stuff...
},
"DOWNLOADER_MIDDLEWARES": {
# more stuff...
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
"server": os.environ.get("PROXY_TR_SERVER"),
"username": os.environ.get("PROXY_TR_USER"),
"password": os.environ.get("PROXY_TR_PASSWORD"),
},
}
}
playwright_meta = {
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_timeout", 10000),
],
}
def __init__(
self,
start_url: str,
# here there is some more stuff...,
**kwargs: Any
):
self.start_urls: List[str] = [start_url]
# boring initializations removed...
url_parsed = urlparse(start_url)
allow_path = url_parsed.path
self.rules = (
Rule(
LinkExtractor(allow=allow_path),
callback="parse_item",
follow=True,
),
)
super().__init__(**kwargs)
def start_requests(self) -> Iterator[Request]:
for url in self.start_urls:
yield Request(url, meta=self.playwright_meta)
def parse_start_url(self, response: Response) -> Dict[str, Any]:
return self.parse_item(response)
def parse_item(self, response: Response) -> Dict[str, Any]:
return {
"status": response.status,
"file_urls": [response.url],
"body": response._get_body(),
"type": response.headers.get("Content-Type", ""),
"latency": response.meta.get("download_latency"),
}
def process_request(self, request: Request):
""" adding playwright headers to all requests... necessary? """
request.meta.update(self.playwright_meta)
return request
In the logs I see that the first page is successfully crawled (and all its links are followed), but the following ones aren't.
First page:
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: Browser context started: 'default'
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-05-12 14:28:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/> (resource type: document, referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/runtime-es2015.3896a8c3776f78100458.js> (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/polyfills-es2015.77ed2742568a17467b11.js> (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/main-es2015.33ff46eac5dca0dd9807.js> (resource type: script, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/styles.d715a958203282df90b1.css> (resource type: stylesheet, referrer: https://discountcasino266.com/)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/polyfills-es2015.77ed2742568a17467b11.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/main-es2015.33ff46eac5dca0dd9807.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://discountcasino266.com/runtime-es2015.3896a8c3776f78100458.js> (referrer: None)
2022-05-12 14:28:15 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://discountcasino266.com/6051-es2015.0d363775a5eb43bd3a29.js> (resource type: script, referrer: https://discountcasino266.com/)
....
Following pages:
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 2 (2 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 3 (3 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 4 (4 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 5 (5 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 6 (6 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 7 (7 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 8 (8 for all contexts)
2022-05-12 14:28:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 9 (9 for all contexts)
2022-05-12 14:28:18 [scrapy.core.engine] INFO: Closing spider (finished)
Upvotes: 2
Views: 1156
Reputation: 171
Try to add callback=self.parse_start_url
in start_requests
, like this:
def start_requests(self) -> Iterator[Request]:
for url in self.start_urls:
yield Request(
url,
callback=self.parse_start_url,
meta=self.playwright_meta
)
Upvotes: 0