Reputation: 2476

scraping the html file saved in local system

I'm trying to scrape an HTML file saved in my local file system (windows 10 os).

when I give the file path in the format

start_urls = ['file:///path/to/file/file_name.htm']

I get the error

[scrapy.core.scraper] ERROR: Error downloading <GET file:///path/to/file/file_name.htm>
FileNotFoundError: [Errno 2] No such file or directory: '\path\to\file\file_name.htm'`

when I give the file path in the format

start_urls = ['path/to/file/file_name.htm']

I get the error

[scrapy.core.engine] ERROR: Error while obtaining start requests
raise ValueError('Missing scheme in request url: %s' % self._url)

How can I read the local HTML file and scrape it in windows os?

Upvotes: 0

Answers (2)

Ruman_bhuiyan

Reputation: 109

You can write the code in this way to scrape your own file saved in local system

from bs4 import BeautifulSoup
import html5lib
myFile=open('C:/Users/CSE/AppData/Local/atom/app-1.42.0/practise.html','r')
soup=BeautifulSoup(myFile,"html5lib")
print(soup.prettify())

Inside open function first parameter is your file_path(though i have given here my own path) and the second parameter is the mode that you wants.

Upvotes: 1

vezunchik

Reputation: 3717

I think this is wrong to use start_urls in this case. Maybe you can try to read data in the file and then apply Selector to it? Check this example:

>>> from scrapy import Selector
>>> f = open('example.html')
>>> sel = Selector(text=f.read())
>>> sel.css('head title::text').get()
Example title

If you need, you can put block with file reading inside function start_requests.

Upvotes: 3

scraping the html file saved in local system

Answers (2)

Related Questions