Rishav Dhariwal
Rishav Dhariwal

Reputation: 33

What resource types to filter out to make the scraper avoid unnecessary requests in scrapy-playwright

Description

I am trying to scrape a site https://slickdeals.net/computer-deals/ . I am trying to get the name, price, and discounted price of the products listed

Now when i run a script to scrape the just the first page of the site by using scrapy crawl the number of requests i start making as shown in the console is enormous. I understand that i am using Playwright middleware so it will act like a browser and make other requests for images and stuff to render the site in full detail, but even considering that the scale of GET requests being shown in my console is enormous. This is making it really hard for me to debug and its slowing down my scraper. I want to understand what these requests are and why I am making them. Also how can i limit these requests just to make the bare minimum? I know about page.route method but am unable to understand what resource types to filter out such that at least my scraper gets its data

Steps to Reproduce

  1. Ran scrapy crawl product_playwright with the following code present in for my spider
import scrapy
from scrapy_playwright.page import PageMethod


class ProductPlaywrightSpider(scrapy.Spider):
    name = "product_playwright"
    def start_requests(self):
        url = "https://slickdeals.net/computer-deals/"
        yield scrapy.Request(url, self.parse,
                             meta=dict(
                                 playwright=True,
                                 playwright_include_page=True,
                                 errback=self.errback,
                             ))

    async def parse(self, response):
        page = response.meta["playwright_page"]
        products = response.xpath('//li[@class="bp-p-blueberryDealCard bp-p-filterGrid_item bp-p-dealCard bp-c-card"]')
        screenshot = await page.screenshot(
            path=f'D:\\Desktop\\Company\\PythonProject\\slickdeals\\pic_of_courses\\page1.png',
            full_page=True)
        await page.close()
        for product in products:
            yield dict(
                name = product.xpath(".//a[@class='bp-c-card_title bp-c-link']/text()").get(),
                discounted_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_price"]/text()').get(),
                original_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_originalPrice"]/text()').get(),
                name_of_store = product.xpath('.//descendant::span[@class="bp-c-card_subtitle"]/text()').get()
            )


    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

Expected behavior: (Had to trim a lot as the limit was exceeding characters was 354000)

I am not sure about this

Actual behavior:

scrapy crawl product_playwright
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: slickdeals)
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.26100-SP0
2025-02-09 12:03:47 [scrapy.addons] INFO: Enabled addons:
[]
2025-02-09 12:03:47 [asyncio] DEBUG: Using selector: SelectSelector
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.extensions.telnet] INFO: Telnet Password: 8a6d05924bf76441
2025-02-09 12:03:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2025-02-09 12:03:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'slickdeals',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'slickdeals.spiders',
 'SPIDER_MODULES': ['slickdeals.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-09 12:03:47 [asyncio] DEBUG: Using proactor: IocpProactor
2025-02-09 12:03:47 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2025-02-09 12:03:48 [scrapy.core.engine] INFO: Spider opened
2025-02-09 12:03:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-09 12:03:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:53 [scrapy-playwright] INFO: Launching browser chromium
2025-02-09 12:03:53 [scrapy-playwright] INFO: Browser chromium launched
2025-02-09 12:03:53 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/computer-deals/> (resource type: document)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/computer-deals/>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/ajax/bSubNavPlacement.php?section=app.php&url=%2Fcomputer-deals%2F> (resource type: other, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://fonts.googleapis.com/css2?family=Blinker:wght@400;600;700&display=swap> (resource type: stylesheet, referrer: https://slickdeals.net/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/min/9310/g=css&style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents> (resource type: stylesheet, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/image-pool/siteFooter/[email protected]> (resource type: image, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/image-pool/siteFooter/[email protected]> (resource type: image, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/build/js/categoryPage.es.00330623.js> (resource type: script, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/image-pool/siteFooter/[email protected]> (resource type: image, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/image-pool/siteFooter/[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/image-pool/siteFooter/[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://static.slickdealscdn.com/mfe-static/dist/components/public/vuerango.browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js> (resource type: script, referrer: https://slickdeals.net/)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/build/js/categoryPage.es.00330623.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/image-pool/siteFooter/[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://fonts.googleapis.com/css2?family=Blinker:wght@400;600;700&display=swap>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://static.slickdealscdn.com/mfe-static/dist/components/public/vuerango.browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/ajax/bSubNavPlacement.php?section=app.php&url=%2Fcomputer-deals%2F>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/scripts/bundles/category-detail-redesign.js?9310> (resource type: script, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/scripts/bundles/category-detail-redesign.js?9310>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals-net.videoplayerhub.com/videoloader.js> (resource type: script, referrer: https://slickdeals.net/)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <301 https://slickdeals-net.videoplayerhub.com/videoloader.js> (location: https://btloader.com/tag?h=slickdeals-net&upapi=true)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://slickdeals.net/min/9310/g=css&style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://btloader.com/tag?h=slickdeals-net&upapi=true> (resource type: script, referrer: https://slickdeals.net/)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://btloader.com/tag?h=slickdeals-net&upapi=true>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://securepubads.g.doubleclick.net/tag/js/gpt.js> (resource type: script, referrer: https://slickdeals.net/)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://c.amazon-adsystem.com/aax2/apstag.js> (resource type: script, referrer: https://slickdeals.net/)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://slickdeals.net/scripts/providerV9.js?1> (resource type: script, referrer: https://slickdeals.net/computer-deals/)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.googletagmanager.com/gtm.js?id=GTM-5XP5PSM&l=gtmDl> (resource type: script, referrer: https://slickdeals.net/)
2025-02-09 13:13:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 998621,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 30.030142,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2025, 2, 9, 7, 43, 15, 263535, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 40,
 'items_per_minute': None,
 'log_count/DEBUG': 637,
 'log_count/INFO': 17,
 'playwright/browser_count': 1,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 322,
 'playwright/request_count/method/GET': 306,
 'playwright/request_count/method/HEAD': 1,
 'playwright/request_count/method/POST': 15,
 'playwright/request_count/navigation': 33,
 'playwright/request_count/resource_type/document': 33,
 'playwright/request_count/resource_type/fetch': 48,
 'playwright/request_count/resource_type/font': 6,
 'playwright/request_count/resource_type/image': 157,
 'playwright/request_count/resource_type/other': 2,
 'playwright/request_count/resource_type/script': 62,
 'playwright/request_count/resource_type/stylesheet': 2,
 'playwright/request_count/resource_type/xhr': 12,
 'playwright/response_count': 266,
 'playwright/response_count/method/GET': 252,
 'playwright/response_count/method/HEAD': 1,
 'playwright/response_count/method/POST': 13,
 'playwright/response_count/resource_type/document': 32,
 'playwright/response_count/resource_type/fetch': 45,
 'playwright/response_count/resource_type/font': 6,
 'playwright/response_count/resource_type/image': 121,
 'playwright/response_count/resource_type/other': 2,
 'playwright/response_count/resource_type/script': 47,
 'playwright/response_count/resource_type/stylesheet': 2,
 'playwright/response_count/resource_type/xhr': 11,
 'response_received_count': 1,
 'responses_per_minute': None,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2025, 2, 9, 7, 42, 45, 233393, tzinfo=datetime.timezone.utc)}
2025-02-09 13:13:15 [scrapy.core.engine] INFO: Spider closed (finished)
2025-02-09 13:13:15 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing browser
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser disconnected

Reproduces how often: every time

Versions(Have added Playwright's version too)

scrapy version --verbose
Scrapy       : 2.12.0
playwright   : 1.49.1
scrapy-playwright : 0.0.42
lxml         : 5.3.0.0
libxml2      : 2.11.7
cssselect    : 1.2.0
parsel       : 1.10.0
w3lib        : 2.2.1
Twisted      : 24.11.0
Python       : 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
pyOpenSSL    : 25.0.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform     : Windows-11-10.0.26100-SP0

Upvotes: 0

Views: 43

Answers (0)

Related Questions