Reputation: 782
I'm using Scrapy to scrape data this site. I need to call getlink
from parse
. Normal call is not working as well when use yield
, i get this error:
2015-11-16 10:12:34 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://www.coldwellbankerhomes.com/fl/miami-dad
e-county/kvc-17_1,17_3,17_2,17_8/incl-22/>
Returning getlink
function from parse
works but i need to execute some code even after returning. I'm confused any help would be really appreciable.
# -*- coding: utf-8 -*-
from scrapy.spiders import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request,Response
import re
import csv
import time
from selenium import webdriver
class ColdWellSpider(BaseSpider):
name = "cwspider"
allowed_domains = ["coldwellbankerhomes.com"]
#start_urls = [''.join(row).strip() for row in csv.reader(open("remaining_links.csv"))]
#start_urls = ['https://www.coldwellbankerhomes.com/fl/boynton-beach/5451-verona-drive-unit-d/pid_9266204/']
start_urls = ['https://www.coldwellbankerhomes.com/fl/miami-dade-county/kvc-17_1,17_3,17_2,17_8/incl-22/']
def parse(self,response):
#browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false'])
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
#to extract all the links from a page and send request to those links
#this works but even after returning i need to execute the while loop
return self.getlink(response)
#for clicking the load more button in the page
while True:
try:
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
self.getlink(response)
except:
break
def getlink(self,response):
print 'hhelo'
c = open('data_getlink.csv', 'a')
d = csv.writer(c, lineterminator='\n')
print 'hello2'
listclass = response.xpath('//div[@class="list-items"]/div[contains(@id,"snapshot")]')
for l in listclass:
link = 'http://www.coldwellbankerhomes.com/'+''.join(l.xpath('./h2/a/@href').extract())
d.writerow([link])
yield Request(url = str(link),callback=self.parse_link)
#callback function of Request
def parse_link(self,response):
b = open('data_parselink.csv', 'a')
a = csv.writer(b, lineterminator='\n')
a.writerow([response.url])
Upvotes: 1
Views: 1301
Reputation: 473803
Spider must return Request, BaseItem, dict or None, got 'generator'
getlink()
is a generator. You are trying to yield
it from the parse()
generator.
Instead, you can/should iterate over the results of getlink()
call:
def parse(self, response):
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
while True:
try:
for request in self.getlink(response):
yield request
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
except:
break
Also, I've noticed you have both self.getlink(response)
and self.getlink(browser)
. The latter is not gonna work since there is no xpath()
method on a webdriver instance - you probably meant to make a Scrapy Selector
out of the page source that your webdriver controlled browser has loaded, for example:
selector = scrapy.Selector(text=browser.page_source)
self.getlink(selector)
You should also take a look on to Explicit Waits with Expected Conditions instead of using unreliable and slow artificial delays via time.sleep()
.
Plus, I'm not sure what is the reason you are writing to CSV manually instead of using built-in Scrapy Items and Item Exporters. And, you are not closing the files properly and not using the with()
context manager either.
Additionally, try to catch more specific exception(s) and avoid having a bare try/expect block.
Upvotes: 4