Ray
Ray

Reputation: 245

Scrapy & Selenium

I'm trying to scrape a single page using Scrapy and Selenium

import time
import scrapy
from selenium import webdriver

class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = ['url-to-scrape']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        time.sleep(30)
        for page in response.css('a'):
            yield {
                'url-href': page.xpath('@href').extract(),
                'url-text': page.css('::text').extract()
            }
        self.driver.quit()

The spider doesn't capture know tags and outputs:

{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl01\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl02\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl03\", \"\", true, \"\", \"\", false, true))"]}

Thoughts?

Upvotes: 0

Views: 1701

Answers (1)

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

You are reading the response from scrapy and trying to work the code on the selenium page this won't work. You need to use the response from your selenium page and create a scrapy response object from the same.

import scrapy
from selenium import webdriver

class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = ['url-to-scrape']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        res = response.replace(body=self.driver.page_source)

        for page in res.css('a'):
            yield {
                'url-href': page.xpath('@href').extract(),
                'url-text': page.css('::text').extract()
            }
        self.driver.quit()

Also time.sleep is not needed in this case

Upvotes: 3

Related Questions