Gill Bates
Gill Bates

Reputation: 15147

Is Scrapy single-threaded or multi-threaded?

There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel? Im asking because, I've read that Scrapy is single-threaded.

Upvotes: 21

Views: 20352

Answers (4)

Mohsin Raza
Mohsin Raza

Reputation: 81

Scrapy is a single-threaded framework, But we can use Multiple threads within a spider at the same time.

please read this article.

https://levelup.gitconnected.com/how-to-run-scrapy-spiders-in-your-program-7db56792c1f7#:~:text=We%20use%20the%20CrawlerProcess%20class,custom%20settings%20for%20the%20Spider

We can use subprocess to run spiders.

import subprocess
subprocess.run(["scrapy", "crawl", "quotes", "-o", "quotes_all.json"])

or

Use CrawlerProcess to run multiple spiders in the same process.

If you want to run multiple spiders per process or want to fetch and use the scraped items directly in your program, you would need to use the internal API of Scrapy.

    # Run the spider with the internal API of Scrapy:
    from scrapy.crawler import Crawler, CrawlerProcess
    from scrapy.utils.project import get_project_settings

def crawler_func(spider, url):
    crawler_process = CrawlerProcess(settings)
    crawler_process.crawl(spider, url)
    crawler_process.start()

def start_spider(spider, urls):
      p = multiprocessing.Pool(100)
      return p.map(partial(crawler_func, spider), urls)

Upvotes: 3

Aman Garg
Aman Garg

Reputation: 3290

Scrapy is single-threaded framework, we cannot use multiple threads within a spider at the same time. However, we can create multiple spiders and piplines at the same time to make the process concurrent. Scrapy does not support multi-threading because it is built on Twisted, which is an Asynchronous http protocol framework.

Upvotes: 4

rocktheartsm4l
rocktheartsm4l

Reputation: 2187

Scrapy does most of it's work synchronously. However, the handling of requests is done asynchronously.

I suggest this page if you haven't already seen it.

http://doc.scrapy.org/en/latest/topics/architecture.html

edit: I realize now the question was about threading and not necessarily whether it's asynchronous or not. That link would still be a good read though :)

regarding your question about CONCURRENT_REQUESTS. This setting changes the number of requests that twisted will defer at once. Once that many requests have been started it will wait for some of them to finish before starting more.

Upvotes: 8

famousgarkin
famousgarkin

Reputation: 14116

Scrapy is single-threaded, except the interactive shell and some tests, see source.

It's built on top of Twisted, which is single-threaded too, and makes use of it's own asynchronous concurrency capabilities, such as twisted.internet.interfaces.IReactorThreads.callFromThread, see source.

Upvotes: 18

Related Questions