Reputation: 157
I have a spider which looks something like this
class SomeSpider(CrawlSpider):
name = 'grablink'
allowed_domains = ['www.yellowpages.com', 'sports.yahoo.com']
start_urls = ['http://www.yellowpages.com/', 'http://sports.yahoo.com']
rules = (Rule(LinkExtractor(allow=allowed_domains), callback='parse_obj', follow=False),)
def parse_obj(self,response):
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item = CrawlsItem()
item['DomainName'] = get_domain(response.url)
item['LinkToOtherDomain'] = link.url
item['LinkFoundOn'] = response.url
yield item
It extracts internal links from the given start url and then follows them and extracts all the external links from the linked pages. It works okay but right now the results are not in any specific order meaning that some rows will be from yellopages.com while the others will be from sports.yahoo.com. I understand that this is normal Scrapy behavior because it is asynchronous but is there anyway I can make it so that first all the links are extracted from yellowpages.com and then sports.yahoo.com and so on. Inside of a particular URL the crawl can be asynchronous, it doesn't matter. But the URLs themselves should be crawled in order.
I think one way to do this would be to keep all the start urls in a separate list and put only one url in start_urls and then run the crawler after the crawl for that url is finished, I'll start the crawl on another url. But I don't know where I would do this. How can I know if a crawl for one URL has been completed so that I can begin the next one?
Upvotes: 2
Views: 1039
Reputation: 11906
You have several options:
I assume you have a Scrapy project setup. You can run a spider using the command line like this:
scrapy crawl myspider
You can see it on your console when the spider finishes. If you would like a CSV export, you can do:
scrapy crawl myspider -o filename.csv -t csv
If you want to pass arguments you can do this:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, url='',):
self.start_urls = [url]
# ...
And then later do:
scrapy crawl myspider -a url=http://myexample.com
Upvotes: 2