Scrapy - flow of the Crawl spider

Question

I'm having a hard time figuring out how Scrapy works (or how I need to work with it). This question is a bit broad - more for understanding.

I setup a CrawlSpider and threw in 6 start urls.
From those (24 items to scrape on each of those start urls) I expected roughly 144 rows to end up in my database, but I only have 18 now.

So I'm using

def parse_start_url(self, response):

to avoid complication with Rules for now. Now Scrapy should take those 6 urls and crawl them and then process the items on those pages. But instead it seems as if it takes those 6 urls and then checks each link on those pages and follows those links first - is this possible?
Does Scrapy just take URL 1, scan all links and follow everything allowed?
When does it take URL 2?

bosnjak · Accepted Answer

You can find your answer in the official documentation page, but for completeness I will paste it here:

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

Note: The way you describe the crawl order is usuall called DFS (depth-first search) or BFS (breadth-first search). Scrapy uses DFO and BFO ('O' is for 'order', but the meaning is the same).

Scrapy - flow of the Crawl spider

Answers (1)

Related Questions