Reputation: 15147
There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler
it will literally fire multiple simultaneous requests in parallel?
Im asking because, I've read that Scrapy is single-threaded.
Upvotes: 21
Views: 20352
Reputation: 81
Scrapy is a single-threaded framework, But we can use Multiple threads within a spider at the same time.
please read this article.
We can use subprocess to run spiders.
import subprocess
subprocess.run(["scrapy", "crawl", "quotes", "-o", "quotes_all.json"])
or
Use CrawlerProcess to run multiple spiders in the same process.
If you want to run multiple spiders per process or want to fetch and use the scraped items directly in your program, you would need to use the internal API of Scrapy.
# Run the spider with the internal API of Scrapy:
from scrapy.crawler import Crawler, CrawlerProcess
from scrapy.utils.project import get_project_settings
def crawler_func(spider, url):
crawler_process = CrawlerProcess(settings)
crawler_process.crawl(spider, url)
crawler_process.start()
def start_spider(spider, urls):
p = multiprocessing.Pool(100)
return p.map(partial(crawler_func, spider), urls)
Upvotes: 3
Reputation: 3290
Scrapy is single-threaded framework, we cannot use multiple threads within a spider at the same time. However, we can create multiple spiders and piplines at the same time to make the process concurrent.
Scrapy does not support multi-threading
because it is built on Twisted
, which is an Asynchronous http protocol framework
.
Upvotes: 4
Reputation: 2187
Scrapy does most of it's work synchronously. However, the handling of requests is done asynchronously.
I suggest this page if you haven't already seen it.
http://doc.scrapy.org/en/latest/topics/architecture.html
edit: I realize now the question was about threading and not necessarily whether it's asynchronous or not. That link would still be a good read though :)
regarding your question about CONCURRENT_REQUESTS. This setting changes the number of requests that twisted will defer at once. Once that many requests have been started it will wait for some of them to finish before starting more.
Upvotes: 8
Reputation: 14116
Scrapy is single-threaded, except the interactive shell and some tests, see source.
It's built on top of Twisted, which is single-threaded too, and makes use of it's own asynchronous concurrency capabilities, such as twisted.internet.interfaces.IReactorThreads.callFromThread
, see source.
Upvotes: 18