Reputation: 25
I am trying to scrape product information from amazon and have run into a problem. When the spider reaches the end of the page it stops, and I want to add a way for my program to generically search the next 3 pages of the page. I am trying to edit start_urls but I cannot do this from inside the function parse. Also, this is not a big deal but the program is asking for the same information twice for some reason. Thanks in advance.
import scrapy
from scrapy import Spider
from scrapy import Request
class ProductSpider(scrapy.Spider):
product = input("What product are you looking for? Keywords help for specific products: ")
name = "Product_spider"
allowed_domains=['www.amazon.ca']
start_urls = ['https://www.amazon.ca/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords='+product]
#so that websites will not block access to the spider
download_delay = 30
def parse(self, response):
temp_url_list = []
for i in range(3,6):
next_url = response.xpath('//*[@id="pagn"]/span['+str(i)+']/a/@href').extract()
next_url_final = response.urljoin(str(next_url[0]))
start_urls.append(str(next_url_final))
# xpath is similar to an address that is used to find certain elements in HTML code,this info is then extracted
product_title = response.xpath('//*/div/div/div/div[2]/div[1]/div[1]/a/@title').extract()
product_price = response.xpath('//span[contains(@class,"s-price")]/text()').extract()
product_url = response.xpath('//*/div/div/div/div[2]/div[1]/div[1]/a/@href').extract()
# yield goes through everything once, saves its spot, does not save info but sends it to the pipeline to get processed if need be
yield{'product_title': product_title, 'product_price': product_price, 'url': product_url,}
# repeating the same process on concurrent pages
#it is checking the same url, no generality, need to find, maybe just do like 5 pages, also see if you can have it sort from high to low and find match with certain amount of key words
Upvotes: 1
Views: 2374
Reputation: 1219
You can override the __init__
method and simply pass the URLs with the -a
option. Please see Spider arguments in the scrapy documentation.
class QuotesSpider(scrapy.Spider):
name = "quotes"
def __init__(self, urls=[], *args, **kwargs):
self.start_urls = urls.split(',')
super(QuotesSpider, self).__init__(*args, **kwargs)
Run it like this:
scrapy crawl quotes -a "urls=http://quotes.toscrape.com/page/1/,http://quotes.toscrape.com/page/2/"
Upvotes: 3
Reputation: 21406
You are misunderstanding how scrapy works here.
Scrapy is expecting your spider to generate (yield) either scrapy.Request objects or scrapy.Item/dictionary objects.
When your spider starts it takes urls from start_urls
and yields a scrapy.Request
for every one of them:
def start_request(self, parse):
for url in self.start_urls:
yield scrapy.Request(url)
So you changing start_urls
won't change anything once the spider has started.
What you can do however is simply yield some more scrapy.Requests
in your parse()
method!
def parse(self, response):
urls = response.xpath('//a/@href').extract()
for url in urls:
yield scrapy.Request(url, self.parse2)
def parse2(self, response):
# new urls!
Upvotes: 6