Reputation: 461
I am trying out Scrapy Playwright for the first time and following examples online. I am getting the error "ModuleNotFoundError: No module named 'scrapy_playwright.middleware'" when I run the command scrapy crawl test_graph
.
This is the Project directory:
website_crawler/
├── scrapy.cfg
├── requirements.txt
├── venv/
├── test_crawler/
│ ├── settings.py
│ ├── spiders/
│ │ ├── init.py
│ │ ├── graph_spider.py
I get this error:
2024-12-13 07:12:41 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-12-13 07:12:41 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.11.3 (v3.11.3:f3909b8bc8, Apr 4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.1.1-arm64-arm-64bit
2024-12-13 07:12:41 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-13 07:12:41 [asyncio] DEBUG: Using selector: KqueueSelector
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-12-13 07:12:41 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-12-13 07:12:41 [scrapy.extensions.telnet] INFO: Telnet Password: aa73a5a9e8c674e6
2024-12-13 07:12:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2024-12-13 07:12:41 [scrapy.crawler] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'test_crawler.spiders',
'SPIDER_MODULES': ['test_crawler.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
Unhandled error in Deferred:
2024-12-13 07:12:41 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/twisted/internet/defer.py", line 2017, in _inlineCallbacks
result = context.run(gen.send, result)
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/crawler.py", line 152, in crawl
self.engine = self._create_engine()
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/crawler.py", line 166, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/core/engine.py", line 101, in __init__
self.downloader: Downloader = downloader_cls(crawler)
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/core/downloader/__init__.py", line 109, in __init__
DownloaderMiddlewareManager.from_crawler(crawler)
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/middleware.py", line 77, in from_crawler
return cls._from_settings(crawler.settings, crawler)
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/middleware.py", line 86, in _from_settings
mwcls = load_object(clspath)
File "/Users/ck/website_crawler/venv/lib/python3.11/site-packages/scrapy/utils/misc.py", line 71, in load_object
mod = import_module(module)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1142, in _find_and_load_unlocked
builtins.ModuleNotFoundError: No module named '
This is my relevant code:
settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': 543,
}
DOWNLOAD_HANDLERS = {
'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
SPIDER_MODULES = ['test_crawler.spiders']
NEWSPIDER_MODULE = 'test_crawler.spiders'
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True}
scrapy.cfg
[settings]
default = test_crawler.settings
graph_spider.py
import scrapy
from scrapy_playwright.page import PageMethod
import networkx as nx
class TestGraphSpider(scrapy.Spider):
name = "test_graph"
start_urls = ["https://cats.com/"]
graph = nx.DiGraph()
visited = set() # Track visited URLs to avoid duplicates
custom_settings = {
"PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
}
def start_requests(self):
"""Start crawling from the homepage."""
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "body"),
],
},
callback=self.parse,
)
def parse(self, response):
"""Parse a page and extract links."""
url = response.url
if url in self.visited:
return
self.visited.add(url)
title = response.css("title::text").get(default="").strip()
self.graph.add_node(url, title=title, content=response.text[:500])
for link in response.css("a::attr(href)").getall():
full_url = response.urljoin(link)
if self.is_internal_link(full_url):
self.graph.add_edge(url, full_url)
yield scrapy.Request(
full_url,
meta={"playwright": True},
callback=self.parse,
)
def is_internal_link(self, url):
return url.startswith("https://cats.com/") and not url.endswith((".pdf", ".jpg", ".png", ".css", ".js"))
def closed(self, reason):
"""Export the graph when the spider finishes."""
nx.write_gexf(self.graph, "test_graph.gexf")
self.logger.info("Graph exported to test_graph.gexf")
In my venv/bin directory, I can see python
python3
and python3.11
the command:
scrapy list
shows: 'test_graph'
Upvotes: 1
Views: 163
Reputation: 3561
DOWNLOADER_MIDDLEWARES = { 'scrapy_playwright.middleware.ScrapyPlaywrightDownloadHandler': 543, }
ScrapyPlaywrightDownloadHandler
is not downloader middleware so it can't be.. enabled as middleware.
According to documentation of https://github.com/scrapy-plugins/scrapy-playwright it is required to update handlers and twisted reactor to enable it
Upvotes: 1