Reputation: 809
I'm trying to use scrapy with selenium to be able to interact with javascript and still have the powerful scraping framework that scrapy offers. I've written a script that visits http://www.iens.nl, enters "Amsterdam" in the search bar and then clicks on the search button succesfully. After clicking on the search button I want scrapy to retreive an element from the newly rendered page. Unfortunately scrapy doesn't return any values.
This is what my code looks like:
from selenium import webdriver
from scrapy.loader import ItemLoader
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from properties import PropertiesItem
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_urls = ['http://www.iens.nl']
def __init__(self):
chrome_path = '/Users/username/Documents/chromedriver'
self.driver = webdriver.Chrome(chrome_path)
def parse(self, response):
self.driver.get(response.url)
text_box = self.driver.find_element_by_xpath('//*[@id="searchText"]')
submit_button = self.driver.find_element_by_xpath('//*[@id="button_search"]')
text_box.send_keys("Amsterdam")
submit_button.click()
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('description', '//*[@id="results"]/ul/li[1]/div[2]/h3/a/')
return l.load_item()
process = CrawlerProcess()
process.crawl(BasicSpider)
process.start()
"properties" is another script that looks like this:
from scrapy.item import Item, Field
class PropertiesItem(Item):
# Primary fields
description = Field()
Q: How do I succesfully make scrapy find the element I call "description" by its xpath on the page selenium reached and return it as output?
Thanks in advance!
Upvotes: 2
Views: 7151
Reputation: 18799
the response
object you are assigning to your ItemLoader
is the scrapy
response, not Selenium's.
I would recommend creating a new Selector
with the page source returned by selenium:
from scrapy import Selector
...
selenium_response_text = driver.page_source
new_selector = Selector(text=selenium_response_text)
l = ItemLoader(item=PropertiesItem(), selector=new_selector)
...
that way the add_xpath
will get information from that response structure instead of scrapy (that you don't actually need).
Upvotes: 5