Reputation: 59
I want to extract data from Amazon.
This is my source code :
from scrapy.contrib.spiders import CrawlSpider
from scrapy import Selector
from selenium import webdriver
from selenium.webdriver.support.select import Select
from time import sleep
import selenium.webdriver.support.ui as ui
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import HtmlResponse, TextResponse
from extraction.items import ProduitItem
class RunnerSpider(CrawlSpider):
name = 'products'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
items = []
sel = Selector(response)
self.driver.get(response.url)
recherche = self.driver.find_element_by_xpath('//*[@id="twotabsearchtextbox"]')
recherche.send_keys("A")
recherche.submit()
resultat = self.driver.find_element_by_xpath('//ul[@id="s-results-list-atf"]')
resultas = resultat.find_elements_by_xpath('//li')
for result in resultas:
item = ProduitItem()
lien = result.find_element_by_xpath('//div[@class="s-item-container"]/div/div/div[2]/div[1]/a')
lien.click()
#lien.implicitly_wait(2)
res = self.driver.find_element_by_xpath('//h1[@id="aiv-content-title"]')
item['TITRE'] = res.text
item['IMAGE'] = lien.find_element_by_xpath('//div[@id="dv-dp-left-content"]/div[1]/div/div/img').get_attribute('src')
items.append(item)
self.driver.close()
yield items
When I run my code I get this error :
Element not found in the cache - perhaps the page has changed since it was looked up Stacktrace:
Upvotes: 0
Views: 464
Reputation: 3691
If you tell Selenium to click on a likn you are moved from the original page to the page behind the link.
In your case you have a result site with some URLs to products on Amazon then you click one of the links in this result list and are moved to the detail site. In this case the site changes and the rest of the elements you want to iterate over in your for
loop is not there -- that's why you get the exception.
Why don't you use the search result site to extract the title and the image? Both are there you would only need to change the XPath expressions to get the right fields of your lien
.
Update
To get the Title from the search result site extract the text in the h2
element of the a
element you want to click.
To get the image you need to take the other div
in the li
element: where in your XPath you select div[2]
you need to select div[1]
to get the image.
If you open the search result site in the browser and look at the sources with developer tools you can see which XPath expression to use for the elements.
Upvotes: 1