Scrapy Crawl Spider, trouble following links

Question

I know there are a dozen or so questions related to this, but none that I saw really had more than one method in their spider...

So I'm scraping a website, starting with the categories page. I'm grabbing the links to the product categories, then attempting to leverage the crawl spider's rules to automatically iterate through the 'next' page in each category, scraping certain info within the page at each step.

The issue is that I simply go to the first page in each category, and seem to ignore the follow=True aspect of the Rule I set. So here's the code, would love some help:

start_urls = ["http://home.mercadolivre.com.br/mais-categorias/"]

rules = (
    # I would like this to force the spider to crawl through the pages... calling the product parser each time
    Rule(LxmlLinkExtractor(allow=(),
    restrict_xpaths = '//*[@id="results-section"]/div[2]/ul/li[@class="pagination__next"]'), follow = True, callback = 'parse_product_links'),
)

def parse(self, response):
    categories = CategoriesItem()
    #categories['categoryLinks'] = []
    for link in LxmlLinkExtractor(allow=('(?<=http://lista.mercadolivre.com.br/delicatessen/)(?:whisky|licor|tequila|vodka|champagnes)'), restrict_xpaths = ("//body")).extract_links(response):
        categories['categoryURL'] = link.url
        yield Request(link.url, meta={'categoryURL': categories['categoryURL']}, callback = self.parse_product_links)


# ideally this function would grab the product links from each page
def parse_product_links(self, response):
  # I have this built out in my code, but it isnt necessary so I wanted to keep it as de-cluttered as possible

Would appreciate any help you can give because it appears as though I don't entirely understand how to like the extractor used in Rules to methods I want to use them within (which is why i have 'parse_product_links' as a callback in two locations

Scrapy Crawl Spider, trouble following links

Answers (1)

Related Questions