Reputation: 348

Scrape with scrapy using saved html pages

I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'

SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]

Upvotes: 2

Answers (1)

mdaniel

Reputation: 33203

I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.

I would expect the script would something like (this is just the general idea, and not guaranteed to even run):

import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings

# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
    html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None  # fill in your Spider class here
cache.store_response(spider, req, resp)

Upvotes: 1

Scrape with scrapy using saved html pages

Answers (1)

Related Questions