How to randomise broad crawling with scrapy indexed a-z

Question

I'm trying to perform a very wide crawl with scrapy. I've followed the basic instructions for a broad crawl from the docs, but I am wondering why I am seeing results being indexed in what seems to be alphanumerical order.

So my queue grows to thousands of items, yet what I see in the output are URLs sorted and scraped seemingly in alphanumerical order, e.g.

asomething.com
1-afoobar.com
001-bar.com
aafoo.com
betabar.com

The scraped results aren't exactly in alphabetical order, but after hundreds of new items they all seemingly start with a number or are very early in the alphabet, indicating some kind of sorting.

It seems to me that while populating the queue there is some kind of sorting going on. This seems to contradict the idea of a broad crawl. Anybody got pointers to why these are queued and scraped like this, and how to "randomize" the queue better?

The spider code extracting links and adding them to the queue:

links = self.linkExtr.extract_links(response)
        for l in links:
            yield response.follow(l, callback=self.parse, meta={"page": l.url})

How to randomise broad crawling with scrapy indexed a-z

Answers (0)

Related Questions