Reputation: 7605
Title says it all, I'm trying to make a CrawlSpider work for some products in Amazon to no avail.
Here is the original URL page I want to get products from.
Looking at the HTML code where the next link is, looks like this:
<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s/ref=sr_pg_2?me=A1COIXT69Y8KR&rh=i%3Amerchant-items&page=2&ie=UTF8&qid=1444414650">
<span id="pagnNextString">Next Page</span>
<span class="srSprite pagnNextArrow"></span>
</a>
This is the current reg expression I'm using:
s/ref=sr_pg_[0-9]\?[^">]+
And using a service like Pythex.org, this seems to be ok, I'm getting this portion of the URL:
s/ref=sr_pg_2?me=A1COIXT69Y8KR&rh=i%3Amerchant-items&page=2&ie=UTF8&qid=1444414650
Here is the code of my crawler:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from amazon.items import AmazonProduct
class AmazonCrawlerSpider(CrawlSpider):
name = 'amazon_crawler'
allowed_domains = ['amazon.com']
#allowed_domains = ['stackoverflow.com']
start_urls = ['http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=1']
#start_urls = ['http://stackoverflow.com/questions?pagesize=50&sort=newest']
rules = [
Rule(LinkExtractor(allow=r's/ref=sr_pg_[0-9]\?[^">]+'),
callback='parse_item', follow=True)
]
'''rules = [
Rule(LinkExtractor(allow=r'questions\?page=[0-9]&sort=newest'),
callback='parse_item', follow=True)
]'''
def parse_item(self, response):
products = response.xpath('//div[@class="summary"]/h3')
for product in products:
item = AmazonProduct()
print('found it!')
yield item
For some unknown reason, the crawler is not following the links. This code is based on the blog tutorial from the guys at RealPython, where they crawl StackOverflow for questions. Actually, just uncomment the commented code to see that this works.
Any idea what I'm missing here? Thanks!
UPDATE:
Based on the answer from @Rejected, I've switched to shell and I could see that the HTML code is different, as he pointed out, than the one I could see in the browser.
Actually, the code Scrapy is getting, the interesting bits, is:
<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s?ie=UTF8&me=A19COJAJDNQSRP&page=2">
<span id="pagnNextString">Next Page</span>
<span class="srSprite pagnNextArrow"></span>
</a>
I've changed my reg expression so it looks like this:
s[^">&]+&me=A19COJAJDNQSRP&page=[0-9]$
Now I'm getting the links in the shell:
[Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=1', text='\n \n \n \n \n \n \n \n ', fragment='', nofollow=False), Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=2', text='2', fragment='', nofollow=False), Link(url='http://www.amazon.com/s?ie=UTF8&me=A19COJAJDNQSRP&page=3', text='3', fragment='', nofollow=False)]
And also the crawler is getting them correctly!
Upvotes: 1
Views: 162
Reputation: 4491
Scrapy is being provided different HTML data than what you are seeing in your browser (even just requesting "view-source:url").
Why, I wasn't able to determine with 100% certainty. The desired three(?) links will match r's/ref=sr_pg_[0-9]'
in your allow path.
Since Amazon is doing something to determine browser, you should test what you're getting in your instance of Scrapy, too. Drop it into shell, and play around with the LinkExtractor
yourself via the following:
LinkExtractor(allow=r's/ref=sr_pg_[0-9]').extract_links(response)
Upvotes: 2