Reputation: 805
I am trying to build a crawler using Scrapy. Every tutorial in the Scrapy' sofficial documentation or in the blog, I See people making a class in the .py code and executing it through scrapy shell.
On their main page, the following example is given
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').extract_first()}
for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)
and then the code is run using
scrapy runspider myspider.py
I am unable to find a way to write the same code in a manner that can be executed with something like
python myspider.py
I also looked in the Requests and response section of their website to understand how the requests and responses are dealt within the shell but trying running those codes within python shell
( >> python myspider.py
)
did not show anything. Any guidance on how to transform the code so that it runs out of scrapy shell, or pointers to any documents that elaborate this will be appreciated.
EDIT Downvoters please do not take undue advantage of your anonymity. If you have a valid point to downvote, please make your point in the comment after you downvote.
Upvotes: 3
Views: 1581
Reputation: 2545
You can use a CrawlerProcess to run your spider in Python main script, and run with python myspider.py
For example:
import scrapy
from scrapy.crawler import CrawlerProcess
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').extract_first()}
for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)
if __name__ == '__main__':
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(BlogSpider)
process.start()
Useful link https://doc.scrapy.org/en/latest/topics/practices.html
Upvotes: 6