Reputation:
I know there are a dozen or so questions related to this, but none that I saw really had more than one method in their spider...
So I'm scraping a website, starting with the categories page. I'm grabbing the links to the product categories, then attempting to leverage the crawl spider's rules to automatically iterate through the 'next' page in each category, scraping certain info within the page at each step.
The issue is that I simply go to the first page in each category, and seem to ignore the follow=True aspect of the Rule I set. So here's the code, would love some help:
start_urls = ["http://home.mercadolivre.com.br/mais-categorias/"]
rules = (
# I would like this to force the spider to crawl through the pages... calling the product parser each time
Rule(LxmlLinkExtractor(allow=(),
restrict_xpaths = '//*[@id="results-section"]/div[2]/ul/li[@class="pagination__next"]'), follow = True, callback = 'parse_product_links'),
)
def parse(self, response):
categories = CategoriesItem()
#categories['categoryLinks'] = []
for link in LxmlLinkExtractor(allow=('(?<=http://lista.mercadolivre.com.br/delicatessen/)(?:whisky|licor|tequila|vodka|champagnes)'), restrict_xpaths = ("//body")).extract_links(response):
categories['categoryURL'] = link.url
yield Request(link.url, meta={'categoryURL': categories['categoryURL']}, callback = self.parse_product_links)
# ideally this function would grab the product links from each page
def parse_product_links(self, response):
# I have this built out in my code, but it isnt necessary so I wanted to keep it as de-cluttered as possible
Would appreciate any help you can give because it appears as though I don't entirely understand how to like the extractor used in Rules to methods I want to use them within (which is why i have 'parse_product_links' as a callback in two locations
Upvotes: 0
Views: 383
Reputation: 21436
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
From CrawlSpider documentations.
It's really unadvisable to use CrawlSpider
if you are not familiar how scrapy works. It's a shortcut that is very implicit and can get confusing.
In your case you override parse
which shouldn't happen and you only have a rule for next page.
So get rid of that parse
method and extend your rules to contain two rules: rule for finding products and rule for finding pages (with follow set to True for this one, since you want to find new pages in new pages).
Upvotes: 0