user7730188
user7730188

Reputation:

Scrapy Crawl Spider, trouble following links

I know there are a dozen or so questions related to this, but none that I saw really had more than one method in their spider...

So I'm scraping a website, starting with the categories page. I'm grabbing the links to the product categories, then attempting to leverage the crawl spider's rules to automatically iterate through the 'next' page in each category, scraping certain info within the page at each step.

The issue is that I simply go to the first page in each category, and seem to ignore the follow=True aspect of the Rule I set. So here's the code, would love some help:

start_urls = ["http://home.mercadolivre.com.br/mais-categorias/"]

rules = (
    # I would like this to force the spider to crawl through the pages... calling the product parser each time
    Rule(LxmlLinkExtractor(allow=(),
    restrict_xpaths = '//*[@id="results-section"]/div[2]/ul/li[@class="pagination__next"]'), follow = True, callback = 'parse_product_links'),
)

def parse(self, response):
    categories = CategoriesItem()
    #categories['categoryLinks'] = []
    for link in LxmlLinkExtractor(allow=('(?<=http://lista.mercadolivre.com.br/delicatessen/)(?:whisky|licor|tequila|vodka|champagnes)'), restrict_xpaths = ("//body")).extract_links(response):
        categories['categoryURL'] = link.url
        yield Request(link.url, meta={'categoryURL': categories['categoryURL']}, callback = self.parse_product_links)


# ideally this function would grab the product links from each page
def parse_product_links(self, response):
  # I have this built out in my code, but it isnt necessary so I wanted to keep it as de-cluttered as possible

Would appreciate any help you can give because it appears as though I don't entirely understand how to like the extractor used in Rules to methods I want to use them within (which is why i have 'parse_product_links' as a callback in two locations

Upvotes: 0

Views: 383

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

From CrawlSpider documentations.

It's really unadvisable to use CrawlSpider if you are not familiar how scrapy works. It's a shortcut that is very implicit and can get confusing.

In your case you override parse which shouldn't happen and you only have a rule for next page. So get rid of that parse method and extend your rules to contain two rules: rule for finding products and rule for finding pages (with follow set to True for this one, since you want to find new pages in new pages).

Upvotes: 0

Related Questions