Scrapy CrawlSpider - Start crawl on next URL only after first URL is complete

Question

I have a spider which looks something like this

class SomeSpider(CrawlSpider):
    name = 'grablink'
    allowed_domains = ['www.yellowpages.com', 'sports.yahoo.com']
    start_urls = ['http://www.yellowpages.com/', 'http://sports.yahoo.com']
    rules = (Rule(LinkExtractor(allow=allowed_domains), callback='parse_obj', follow=False),)

    def parse_obj(self,response):
        for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
            item = CrawlsItem()
            item['DomainName'] = get_domain(response.url)
            item['LinkToOtherDomain'] = link.url
            item['LinkFoundOn'] = response.url
            yield item

It extracts internal links from the given start url and then follows them and extracts all the external links from the linked pages. It works okay but right now the results are not in any specific order meaning that some rows will be from yellopages.com while the others will be from sports.yahoo.com. I understand that this is normal Scrapy behavior because it is asynchronous but is there anyway I can make it so that first all the links are extracted from yellowpages.com and then sports.yahoo.com and so on. Inside of a particular URL the crawl can be asynchronous, it doesn't matter. But the URLs themselves should be crawled in order.

I think one way to do this would be to keep all the start urls in a separate list and put only one url in start_urls and then run the crawler after the crawl for that url is finished, I'll start the crawl on another url. But I don't know where I would do this. How can I know if a crawl for one URL has been completed so that I can begin the next one?

Scrapy CrawlSpider - Start crawl on next URL only after first URL is complete

Answers (1)

Related Questions