How to crawl local HTML file with Scrapy

Question

I tried to crawl a local HTML file stored in my desktop with the code below, but I encounter the following errors before crawling procedure, such as "No such file or directory: '/robots.txt'".

Is it possible to crawl local HTML files in a local computer(Mac)?
If possible, how should I set parameters like "allowed_domains" and "start_urls"?

[Scrapy command]

$ scrapy crawl test -o test01.csv

[Scrapy spider]

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = []
    start_urls = ['file:///Users/Name/Desktop/test/test.html']

[Errors]

2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying  (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying  (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'

Japes · Accepted Answer

When working with it locally, I never specify the allowed_domains. Try to take that line of code out and see if it works.

In your error its testing the 'empty' domain that you have given it.

How to crawl local HTML file with Scrapy

Answers (2)

Related Questions