Reputation: 2476
I'm trying to scrape an HTML file saved in my local file system (windows 10 os).
when I give the file path in the format
start_urls = ['file:///path/to/file/file_name.htm']
I get the error
[scrapy.core.scraper] ERROR: Error downloading <GET file:///path/to/file/file_name.htm>
FileNotFoundError: [Errno 2] No such file or directory: '\path\to\file\file_name.htm'`
when I give the file path in the format
start_urls = ['path/to/file/file_name.htm']
I get the error
[scrapy.core.engine] ERROR: Error while obtaining start requests
raise ValueError('Missing scheme in request url: %s' % self._url)
How can I read the local HTML file and scrape it in windows os?
Upvotes: 0
Views: 5221
Reputation: 109
You can write the code in this way to scrape your own file saved in local system
from bs4 import BeautifulSoup
import html5lib
myFile=open('C:/Users/CSE/AppData/Local/atom/app-1.42.0/practise.html','r')
soup=BeautifulSoup(myFile,"html5lib")
print(soup.prettify())
Inside open function first parameter is your file_path(though i have given here my own path) and the second parameter is the mode that you wants.
Upvotes: 1
Reputation: 3717
I think this is wrong to use start_urls
in this case. Maybe you can try to read data in the file and then apply Selector
to it?
Check this example:
>>> from scrapy import Selector
>>> f = open('example.html')
>>> sel = Selector(text=f.read())
>>> sel.css('head title::text').get()
Example title
If you need, you can put block with file reading inside function start_requests
.
Upvotes: 3