Python Scrapy: How do you run your spider from a seperate file?

Question

So I've created a spider in scrapy that now successfully targets all the text I want.

How exactly do you execute this spider in another python file? Cause I want to be able to pass it new URLs/store the data it finds within a dictionary and then a dataframe.

Cause at the moment I can only get it to run with terminal command 'scrapy crawl SpiderName'

from scrapy.spiders import Spider
from scrapy_splash import SplashRequest


class SpiderName(Spider):
    name = 'SpiderName'
    Page = 'https://www.urlname.com'

    def start_requests(self):
        yield SplashRequest(url=self.Page, callback=self.parse,
                            endpoint ='render.html',
                            args={'wait': 0.5},
                            )

    def parse(self, response):
        for x in response.css("div.row.list"):
            yield {
                'Entry': x.css("span[data-bind]::text").getall()

            }

Thanks

furas · Accepted Answer

In Scrapy doc Common Practices you can see Run Scrapy from a script

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # ... Your spider definition ...

# ... run it ...

process = CrawlerProcess(settings={ ... })    
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

If you add own __init__

class MySpider(scrapy.Spider):

    def __init__(self, urls, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.start_urls = urls

then you could run it with urls as parameter

process.crawl(MySpider, urls=['http://books.toscrape.com/', 'http://quotes.toscrape.com/'])

Python Scrapy: How do you run your spider from a seperate file?

Answers (1)

Related Questions