Reputation: 245
I'm trying to scrape a single page using Scrapy and Selenium
import time
import scrapy
from selenium import webdriver
class SampleSpider(scrapy.Spider):
name = "sample"
start_urls = ['url-to-scrape']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
time.sleep(30)
for page in response.css('a'):
yield {
'url-href': page.xpath('@href').extract(),
'url-text': page.css('::text').extract()
}
self.driver.quit()
The spider doesn't capture know tags and outputs:
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl01\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl02\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl03\", \"\", true, \"\", \"\", false, true))"]}
Thoughts?
Upvotes: 0
Views: 1701
Reputation: 146510
You are reading the response from scrapy and trying to work the code on the selenium page this won't work. You need to use the response from your selenium page and create a scrapy response object from the same.
import scrapy
from selenium import webdriver
class SampleSpider(scrapy.Spider):
name = "sample"
start_urls = ['url-to-scrape']
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
self.driver.get(response.url)
res = response.replace(body=self.driver.page_source)
for page in res.css('a'):
yield {
'url-href': page.xpath('@href').extract(),
'url-text': page.css('::text').extract()
}
self.driver.quit()
Also time.sleep
is not needed in this case
Upvotes: 3