sulav_lfc
sulav_lfc

Reputation: 782

python return multiple times

I'm using Scrapy to scrape data this site. I need to call getlinkfrom parse. Normal call is not working as well when use yield, i get this error:

2015-11-16 10:12:34 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://www.coldwellbankerhomes.com/fl/miami-dad

e-county/kvc-17_1,17_3,17_2,17_8/incl-22/>

Returning getlink function from parse works but i need to execute some code even after returning. I'm confused any help would be really appreciable.

    # -*- coding: utf-8 -*-

    from scrapy.spiders import BaseSpider
    from scrapy.selector import Selector
    from scrapy.http import Request,Response
    import re
    import csv
    import time

    from selenium import webdriver



    class ColdWellSpider(BaseSpider):
        name = "cwspider"
        allowed_domains = ["coldwellbankerhomes.com"]
        #start_urls = [''.join(row).strip() for row in csv.reader(open("remaining_links.csv"))]
        #start_urls = ['https://www.coldwellbankerhomes.com/fl/boynton-beach/5451-verona-drive-unit-d/pid_9266204/']
        start_urls = ['https://www.coldwellbankerhomes.com/fl/miami-dade-county/kvc-17_1,17_3,17_2,17_8/incl-22/']

        def parse(self,response):

                #browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false'])
                browser = webdriver.Firefox()
                browser.maximize_window()
                browser.get(response.url)
                time.sleep(5)

                #to extract all the links from a page and send request to those links
                #this works but even after returning i need to execute the while loop
                return self.getlink(response)

                #for clicking the load more button in the page 
                while True:
                    try:
                        browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
                        time.sleep(3)
                        self.getlink(response)

                    except:
                        break

        def getlink(self,response):
            print 'hhelo'

            c = open('data_getlink.csv', 'a')
            d = csv.writer(c, lineterminator='\n')
            print 'hello2'
            listclass = response.xpath('//div[@class="list-items"]/div[contains(@id,"snapshot")]')

            for l in listclass:
                   link = 'http://www.coldwellbankerhomes.com/'+''.join(l.xpath('./h2/a/@href').extract())

                   d.writerow([link])
                   yield Request(url = str(link),callback=self.parse_link)


       #callback function of Request
       def parse_link(self,response):
                b = open('data_parselink.csv', 'a')
                a = csv.writer(b, lineterminator='\n')
                a.writerow([response.url])

Upvotes: 1

Views: 1301

Answers (1)

alecxe
alecxe

Reputation: 473803

Spider must return Request, BaseItem, dict or None, got 'generator'

getlink() is a generator. You are trying to yield it from the parse() generator.

Instead, you can/should iterate over the results of getlink() call:

def parse(self, response):
    browser = webdriver.Firefox()
    browser.maximize_window()
    browser.get(response.url)
    time.sleep(5)

    while True:
        try:
            for request in self.getlink(response):
                yield request

            browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
            time.sleep(3)
        except:
            break

Also, I've noticed you have both self.getlink(response) and self.getlink(browser). The latter is not gonna work since there is no xpath() method on a webdriver instance - you probably meant to make a Scrapy Selector out of the page source that your webdriver controlled browser has loaded, for example:

selector = scrapy.Selector(text=browser.page_source)
self.getlink(selector)

You should also take a look on to Explicit Waits with Expected Conditions instead of using unreliable and slow artificial delays via time.sleep().

Plus, I'm not sure what is the reason you are writing to CSV manually instead of using built-in Scrapy Items and Item Exporters. And, you are not closing the files properly and not using the with() context manager either.

Additionally, try to catch more specific exception(s) and avoid having a bare try/expect block.

Upvotes: 4

Related Questions