Reputation: 802
I have recently returned to a scrapy code y made some months ago.
The objective of the code was to scrape some amazon products for data, it worked like this:
Lets take this page as an example
What the code does is enter every product of that page and get data from it, after it finished scraping all the data from that page, it moved to the next one (page 2 in this case).
That last part stopped working.
I have something like this in the rules (I had to re-write some of the xpaths because they were outdated)
import scrapy
import re
import string
import random
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapyJuan.items import GenericItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
class GenericScraperSpider(CrawlSpider):
name = "generic_spider"
#Dominio permitido
allowed_domain = ['www.amazon.com']
search_url = 'https://www.amazon.com/s?field-keywords={}'
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI' : 'GenericProducts.csv'
}
rules = {
#Next button
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//li[@class="a-last"]/a/@href') )),
#Every element of the page
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//a[contains(@class, "a-link-normal") and contains(@class,"a-text-normal")]') ),
callback = 'parse_item', follow = False)
}
def start_requests(self):
txtfile = open('productosGenericosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
This worked like a month ago, but I cant make it work now.
Any ideas on whats wrong?
Upvotes: 0
Views: 105
Reputation: 675
Amazon has an antibot mechanism to request captcha after some iterations. You can confirm it checking the returned HTTP code, if it's waiting for captcha you should receive something like 503 Service Unavailable
. I don't see anything wrong on your code snippet (apart from {}
instead of ()
on rules
, which actually isn't affecting the results, since you still can iterate over it).
Furthermore, make sure your spider is inheriting CrawlSpider
and not Scrapy
Upvotes: 2