Reputation: 10570
I want to run my spider from a script rather than a scrap crawl
I found this page
http://doc.scrapy.org/en/latest/topics/practices.html
but actually it doesn't say where to put that script.
any help please?
Upvotes: 26
Views: 35113
Reputation: 1820
Building on the response by @AlmogCohen above:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spider_project.spiders.spider import MySpider
def run_spider():
process = CrawlerProcess(get_project_settings())
# Run the spider programmatically
process.crawl(MySpider)
process.start()
if __name__ == "__main__":
run_spider() # Example usage
Here "spider_project" is your folder that contains the folder "spiders" with your spiders in it. "MySpider" is the class that is child of scrapy.Spider
.
If your spider takes command-line args, you can declare them in the function definition of run_spider
and pass them during function call.
This is very useful, for example, when calling the spider from Rest API endpoint in FastAPI.
Upvotes: 0
Reputation: 1313
It is simple and straightforward :)
Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.py
and not every time you just import from it. Just add an if __name__ == "__main__"
:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
pass
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Now save the file as myscript.py
and run 'python myscript.py`.
Enjoy!
Upvotes: 44
Reputation: 19146
Why don't you just do this?
from scrapy import cmdline
cmdline.execute("scrapy crawl myspider".split())
Put that script in the same path where you put scrapy.cfg
Upvotes: 6
Reputation: 23796
You can just create a normal Python script, and then use Scrapy's command line option runspider
, that allows you to run a spider without having to create a project.
For example, you can create a single file stackoverflow_spider.py
with something like this:
import scrapy
class QuestionItem(scrapy.item.Item):
idx = scrapy.item.Field()
title = scrapy.item.Field()
class StackoverflowSpider(scrapy.spider.Spider):
name = 'SO'
start_urls = ['http://stackoverflow.com']
def parse(self, response):
sel = scrapy.selector.Selector(response)
questions = sel.css('#question-mini-list .question-summary')
for i, elem in enumerate(questions):
l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
l.add_value('idx', i)
l.add_xpath('title', ".//h3/a/text()")
yield l.load_item()
Then, provided you have scrapy properly installed, you can run it using:
scrapy runspider stackoverflow_spider.py -t json -o questions-items.json
Upvotes: 3
Reputation: 11396
luckily scrapy source is open, so you can follow the way crawl command works and do the same in your code:
...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()
Upvotes: 6