Marco Dinatsoli
Marco Dinatsoli

Reputation: 10570

scrapy run spider from script

I want to run my spider from a script rather than a scrap crawl

I found this page

http://doc.scrapy.org/en/latest/topics/practices.html

but actually it doesn't say where to put that script.

any help please?

Upvotes: 26

Views: 35113

Answers (5)

Sun Bee
Sun Bee

Reputation: 1820

Building on the response by @AlmogCohen above:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spider_project.spiders.spider import MySpider

def run_spider():
    process = CrawlerProcess(get_project_settings())

    # Run the spider programmatically
    process.crawl(MySpider)
    process.start()

if __name__ == "__main__":
    run_spider()  # Example usage

Here "spider_project" is your folder that contains the folder "spiders" with your spiders in it. "MySpider" is the class that is child of scrapy.Spider.

If your spider takes command-line args, you can declare them in the function definition of run_spider and pass them during function call.

This is very useful, for example, when calling the spider from Rest API endpoint in FastAPI.

Upvotes: 0

Almog Cohen
Almog Cohen

Reputation: 1313

It is simple and straightforward :)

Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.py and not every time you just import from it. Just add an if __name__ == "__main__":

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    pass

if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(MySpider)
    process.start() # the script will block here until the crawling is finished

Now save the file as myscript.py and run 'python myscript.py`.

Enjoy!

Upvotes: 44

Aminah Nuraini
Aminah Nuraini

Reputation: 19146

Why don't you just do this?

from scrapy import cmdline

cmdline.execute("scrapy crawl myspider".split())

Put that script in the same path where you put scrapy.cfg

Upvotes: 6

Elias Dorneles
Elias Dorneles

Reputation: 23796

You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.

For example, you can create a single file stackoverflow_spider.py with something like this:

import scrapy

class QuestionItem(scrapy.item.Item):
    idx = scrapy.item.Field()
    title = scrapy.item.Field()

class StackoverflowSpider(scrapy.spider.Spider):
    name = 'SO'
    start_urls = ['http://stackoverflow.com']
    def parse(self, response):
        sel = scrapy.selector.Selector(response)
        questions = sel.css('#question-mini-list .question-summary')
        for i, elem in enumerate(questions):
            l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
            l.add_value('idx', i)
            l.add_xpath('title', ".//h3/a/text()")
            yield l.load_item()

Then, provided you have scrapy properly installed, you can run it using:

scrapy runspider stackoverflow_spider.py -t json -o questions-items.json

Upvotes: 3

Guy Gavriely
Guy Gavriely

Reputation: 11396

luckily scrapy source is open, so you can follow the way crawl command works and do the same in your code:

...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()

Upvotes: 6

Related Questions