finite_diffidence
finite_diffidence

Reputation: 943

How to scrape a directory full of .html files using scrapy?

I have a folder full of .html files. Is there a way to scrape the data using scrapy?

My attempt:

import scrapy
import os

LOCAL_FOLDER = 'html_files/'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

class MySpider(scrapy.Spider):
    name = 'mySpider'
    start_urls = [f"file://{BASE_DIR}/{LOCAL_FOLDER}"]

    def parse(self, response):
        rows = response.xpath('//div[@class="data"]//tbody/tr')
        print(rows)

structure:

html_files/
    ├── b.html
    ├── c.html
    ├── d.html
    ├── e.html
    ├── f.html

Any guidance would be much appreciated.

Upvotes: 1

Views: 570

Answers (1)

SuperUser
SuperUser

Reputation: 4822

I have created 4 html files (1.html - 4.html) in html_files directory:

import scrapy
import os


class TestSpider(scrapy.Spider):
    name = 'tempspider'
    path = r'html_files'
    base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

    def start_requests(self):
        for file in os.listdir(self.path):
            url = 'file:///' + os.path.join(self.base_dir, self.path, file)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response.xpath('//text()').get())

Output:

[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C1.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C2.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C3.html> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET file:///...........%5Chtml_files%5C4.html> (referer: None)
html 1
html 2
html 3
html 4

Upvotes: 2

Related Questions