Reputation: 865
I need to create a spider that crawls for some data from web site. part of the data is an external URL.
I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages.
I was thinking of creating a crawlspider that uses the SgmlLinkExtractor to follow some specific links in each external web page.
what is the recommended way to communicate the list of start_url to the second spider?
My idea is to generate a json file for the items and to read the attribute in start_requests of the second spider.
Upvotes: 1
Views: 1548
Reputation: 59664
I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages.
Save these external page urls to a db.
what is the recommended way to communicate the list of start_url to the second spider?
Override BaseSpider.start_requests
in your other spider and create requests from urls you get from the db.
Upvotes: 2
Reputation: 2003
The question is pretty vague but here is one way (??)
class PracticeSpider(BaseSpider):
name = "project_name"
allowed_domains = ["my_domain.org"]
def start_requests(self):
start_urls = "The First Page URL"
return [Request(start_urls, callback=self.parse)]
def parse(self, response):
# parse the first page
yield self.pageParser(response)
# grab the external URLs you want to follow
ext_urls = ...
for url in ext_urls:
yield Request(url, callback=self.pageParser)
def pageParser(self, response):
# parse the page
return items
There is also a meta={} attribute in Request that might be of some help.
Upvotes: 0