Reputation: 1654
I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.
I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response)
method to process artist data, parse_album(response)
method to process album data.
My question is, If a page contains both artist and album data, how should I define my rules?
Is there other way to do it. (A proper way)
class ExampleSpider(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
# more rules .....
]
def parse_artist(self, response):
artist_item = ArtistItem()
try:
# do the scrape and assign to ArtistItem
except Exception:
# ignore for now
pass
return artist_item
pass
def parse_album(self, response):
album_item = AlbumItem()
try:
# do the scrape and assign to AlbumItem
except Exception:
# ignore for now
pass
return album_item
pass
pass
Upvotes: 3
Views: 4422
Reputation: 161734
The CrawlSpider
calls _requests_to_follow()
method to extract urls and generate requests to follow:
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
As you can see:
seen
memorizes urls
has been processed. url
will be parsed by at most one callback
. You can define a parse_item()
to call parse_artist()
and parse_album()
:
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
# more rules .....
]
def parse_item(self, response):
yield self.parse_artist(response)
yield self.parse_album(response)
Upvotes: 9