harshvardhan
harshvardhan

Reputation: 805

Can we run scrapy code outside of scrapy shell?

I am trying to build a crawler using Scrapy. Every tutorial in the Scrapy' sofficial documentation or in the blog, I See people making a class in the .py code and executing it through scrapy shell.

On their main page, the following example is given

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)

and then the code is run using

scrapy runspider myspider.py

I am unable to find a way to write the same code in a manner that can be executed with something like

python myspider.py

I also looked in the Requests and response section of their website to understand how the requests and responses are dealt within the shell but trying running those codes within python shell

( >> python myspider.py )

did not show anything. Any guidance on how to transform the code so that it runs out of scrapy shell, or pointers to any documents that elaborate this will be appreciated.

EDIT Downvoters please do not take undue advantage of your anonymity. If you have a valid point to downvote, please make your point in the comment after you downvote.

Upvotes: 3

Views: 1581

Answers (1)

Ami Hollander
Ami Hollander

Reputation: 2545

You can use a CrawlerProcess to run your spider in Python main script, and run with python myspider.py

For example:

import scrapy
from scrapy.crawler import CrawlerProcess


class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)


if __name__ == '__main__':
    class MySpider(scrapy.Spider):
        # Your spider definition
        ...


    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(BlogSpider)
    process.start()

Useful link https://doc.scrapy.org/en/latest/topics/practices.html

Upvotes: 6

Related Questions