AlejandroVK
AlejandroVK

Reputation: 7605

CrawlSpider not following links

Title says it all, I'm trying to make a CrawlSpider work for some products in Amazon to no avail.

Here is the original URL page I want to get products from.

Looking at the HTML code where the next link is, looks like this:

<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s/ref=sr_pg_2?me=A1COIXT69Y8KR&amp;rh=i%3Amerchant-items&amp;page=2&amp;ie=UTF8&amp;qid=1444414650">
    <span id="pagnNextString">Next Page</span>
    <span class="srSprite pagnNextArrow"></span>
</a>

This is the current reg expression I'm using:

s/ref=sr_pg_[0-9]\?[^">]+

And using a service like Pythex.org, this seems to be ok, I'm getting this portion of the URL:

s/ref=sr_pg_2?me=A1COIXT69Y8KR&amp;rh=i%3Amerchant-items&amp;page=2&amp;ie=UTF8&amp;qid=1444414650

Here is the code of my crawler:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from amazon.items import AmazonProduct


class AmazonCrawlerSpider(CrawlSpider):
    name = 'amazon_crawler'
    allowed_domains = ['amazon.com']
    #allowed_domains = ['stackoverflow.com']
    start_urls = ['http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=1']
    #start_urls = ['http://stackoverflow.com/questions?pagesize=50&sort=newest']
    rules = [
        Rule(LinkExtractor(allow=r's/ref=sr_pg_[0-9]\?[^">]+'),
           callback='parse_item', follow=True)
    ]
    '''rules = [
        Rule(LinkExtractor(allow=r'questions\?page=[0-9]&sort=newest'),
           callback='parse_item', follow=True)
    ]'''


    def parse_item(self, response):
        products = response.xpath('//div[@class="summary"]/h3')
        for product in products:
            item = AmazonProduct()
            print('found it!')
            yield item

For some unknown reason, the crawler is not following the links. This code is based on the blog tutorial from the guys at RealPython, where they crawl StackOverflow for questions. Actually, just uncomment the commented code to see that this works.

Any idea what I'm missing here? Thanks!

UPDATE:

Based on the answer from @Rejected, I've switched to shell and I could see that the HTML code is different, as he pointed out, than the one I could see in the browser.

Actually, the code Scrapy is getting, the interesting bits, is:

<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s?ie=UTF8&me=A19COJAJDNQSRP&page=2">
    <span id="pagnNextString">Next Page</span>
    <span class="srSprite pagnNextArrow"></span>
</a>

I've changed my reg expression so it looks like this:

s[^">&]+&me=A19COJAJDNQSRP&page=[0-9]$

Now I'm getting the links in the shell:

[Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=1', text='\n    \n        \n            \n            \n            \n            \n    \n    ', fragment='', nofollow=False), Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=2', text='2', fragment='', nofollow=False), Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=3', text='3', fragment='', nofollow=False)]

And also the crawler is getting them correctly!

Upvotes: 1

Views: 162

Answers (1)

Rejected
Rejected

Reputation: 4491

Scrapy is being provided different HTML data than what you are seeing in your browser (even just requesting "view-source:url").

Why, I wasn't able to determine with 100% certainty. The desired three(?) links will match r's/ref=sr_pg_[0-9]' in your allow path.

Since Amazon is doing something to determine browser, you should test what you're getting in your instance of Scrapy, too. Drop it into shell, and play around with the LinkExtractor yourself via the following:

LinkExtractor(allow=r's/ref=sr_pg_[0-9]').extract_links(response)

Upvotes: 2

Related Questions