Didi
Didi

Reputation: 461

ModuleNotFoundError: No module named 'scrapy_playwright.middleware'

I am trying out Scrapy Playwright for the first time and following examples online. I am getting the error "ModuleNotFoundError: No module named 'scrapy_playwright.middleware'" when I run the command scrapy crawl test_graph.

This is the Project directory:

website_crawler/ ├── scrapy.cfg
├── requirements.txt
├── venv/
├── test_crawler/
│ ├── settings.py
│ ├── spiders/
│ │ ├── init.py │ │ ├── graph_spider.py

I get this error:


2024-12-13 07:12:41 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-12-13 07:12:41 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.1.1-arm64-arm-64bit
2024-12-13 07:12:41 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-13 07:12:41 [asyncio] DEBUG: Using selector: KqueueSelector
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-12-13 07:12:41 [scrapy.extensions.telnet] INFO: Telnet Password: aa73a5a9e8c674e6
2024-12-13 07:12:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-12-13 07:12:41 [scrapy.crawler] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'test_crawler.spiders',
 'SPIDER_MODULES': ['test_crawler.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
Unhandled error in Deferred:
2024-12-13 07:12:41 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 2017, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/crawler.py", line 152, in crawl
    self.engine = self._create_engine()
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/crawler.py", line 166, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 101, in __init__
    self.downloader: Downloader = downloader_cls(crawler)
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/core/downloader/__init__.py", line 109, in __init__
    DownloaderMiddlewareManager.from_crawler(crawler)
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/middleware.py", line 77, in from_crawler
    return cls._from_settings(crawler.settings, crawler)
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/middleware.py", line 86, in _from_settings
    mwcls = load_object(clspath)
  File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/utils/misc.py", line 71, in load_object
    mod = import_module(module)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
    
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
    
  File "<frozen importlib._bootstrap>", line 1142, in _find_and_load_unlocked
    
builtins.ModuleNotFoundError: No module named '

This is my relevant code:

settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': 543,
}

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

SPIDER_MODULES = ['test_crawler.spiders']
NEWSPIDER_MODULE = 'test_crawler.spiders'

PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}

scrapy.cfg

[settings]
default = test_crawler.settings

graph_spider.py

import scrapy
from scrapy_playwright.page import PageMethod
import networkx as nx


class TestGraphSpider(scrapy.Spider):
    name = "test_graph"
    start_urls = ["https://cats.com/"]
    graph = nx.DiGraph() 
    visited = set()  # Track visited URLs to avoid duplicates

    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
 }

    def start_requests(self):
        """Start crawling from the homepage."""
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    "playwright": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", "body"),  
 ],
 },
                callback=self.parse,
 )

    def parse(self, response):
        """Parse a page and extract links."""

        url = response.url
        if url in self.visited: 
            return
        self.visited.add(url)

   
        title = response.css("title::text").get(default="").strip()
        self.graph.add_node(url, title=title, content=response.text[:500]) 


        for link in response.css("a::attr(href)").getall():
            full_url = response.urljoin(link)
            if self.is_internal_link(full_url):  
                self.graph.add_edge(url, full_url)  
                yield scrapy.Request(
                    full_url,
                    meta={"playwright": True},
                    callback=self.parse,
 )

    def is_internal_link(self, url):
   
        return url.startswith("https://cats.com/") and not url.endswith((".pdf", ".jpg", ".png", ".css", ".js"))

    def closed(self, reason):
        """Export the graph when the spider finishes."""
        nx.write_gexf(self.graph, "test_graph.gexf")
        self.logger.info("Graph exported to test_graph.gexf")

In my venv/bin directory, I can see python python3 and python3.11

the command:

scrapy list

shows: 'test_graph'

Upvotes: 1

Views: 163

Answers (1)

Georgiy
Georgiy

Reputation: 3561


DOWNLOADER_MIDDLEWARES = {
    'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': 543,
} 

ScrapyPlaywrightDownloadHandler is not downloader middleware so it can't be.. enabled as middleware.

According to documentation of https://github.com/scrapy-plugins/scrapy-playwright it is required to update handlers and twisted reactor to enable it

Upvotes: 1

Related Questions