Reputation: 943
I have a folder full of .html files. Is there a way to scrape the data using scrapy?
My attempt:
import scrapy
import os
LOCAL_FOLDER = 'html_files/'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
class MySpider(scrapy.Spider):
name = 'mySpider'
start_urls = [f"file://{BASE_DIR}/{LOCAL_FOLDER}"]
def parse(self, response):
rows = response.xpath('//div[@class="data"]//tbody/tr')
print(rows)
structure:
html_files/
├── b.html
├── c.html
├── d.html
├── e.html
├── f.html
Any guidance would be much appreciated.
Upvotes: 1
Views: 570
Reputation: 4822
I have created 4 html files (1.html - 4.html) in html_files directory:
import scrapy
import os
class TestSpider(scrapy.Spider):
name = 'tempspider'
path = r'html_files'
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
def start_requests(self):
for file in os.listdir(self.path):
url = 'file:///' + os.path.join(self.base_dir, self.path, file)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.xpath('//text()').get())
Output:
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C1.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C2.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C3.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C4.html> (referer: None)
html 1
html 2
html 3
html 4
Upvotes: 2