detweiller
detweiller

Reputation: 157

Scrapy CrawlSpider - Start crawl on next URL only after first URL is complete

I have a spider which looks something like this

class SomeSpider(CrawlSpider):
    name = 'grablink'
    allowed_domains = ['www.yellowpages.com', 'sports.yahoo.com']
    start_urls = ['http://www.yellowpages.com/', 'http://sports.yahoo.com']
    rules = (Rule(LinkExtractor(allow=allowed_domains), callback='parse_obj', follow=False),)

    def parse_obj(self,response):
        for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
            item = CrawlsItem()
            item['DomainName'] = get_domain(response.url)
            item['LinkToOtherDomain'] = link.url
            item['LinkFoundOn'] = response.url
            yield item

It extracts internal links from the given start url and then follows them and extracts all the external links from the linked pages. It works okay but right now the results are not in any specific order meaning that some rows will be from yellopages.com while the others will be from sports.yahoo.com. I understand that this is normal Scrapy behavior because it is asynchronous but is there anyway I can make it so that first all the links are extracted from yellowpages.com and then sports.yahoo.com and so on. Inside of a particular URL the crawl can be asynchronous, it doesn't matter. But the URLs themselves should be crawled in order.

I think one way to do this would be to keep all the start urls in a separate list and put only one url in start_urls and then run the crawler after the crawl for that url is finished, I'll start the crawl on another url. But I don't know where I would do this. How can I know if a crawl for one URL has been completed so that I can begin the next one?

Upvotes: 2

Views: 1039

Answers (1)

masnun
masnun

Reputation: 11906

You have several options:

  • Write Site specific spiders, like one for Yellow Pages, one for Yahoo!
  • Write a generic spider and take arguments while running the spider to decide which site to crawl.
  • If you want to do it manually, you can just change the url by hand and run it.

I assume you have a Scrapy project setup. You can run a spider using the command line like this:

scrapy crawl myspider

You can see it on your console when the spider finishes. If you would like a CSV export, you can do:

scrapy crawl myspider -o filename.csv -t csv

If you want to pass arguments you can do this:

class MySpider(BaseSpider):
    name = 'myspider'

    def __init__(self, url='',):
        self.start_urls = [url]
        # ...

And then later do:

scrapy crawl myspider -a url=http://myexample.com

Upvotes: 2

Related Questions