loremIpsum1771
loremIpsum1771

Reputation: 2527

urlopen error when scrapy spider is run from a script

I've written a script that runs a scrapy spider that is located inside a different directory. The script takes in user input, parses it and adds it to a url to be scraped. The script seemed to be working earlier but now I'm getting the following error:

URLError: <urlopen error [Errno 101] Network is unreachable>
ERROR: Unable to read instance data, giving up

The code for the spider works properly when run with the scrapy crawl command, but isn't working when run from a script for some reason. Here is the code for the function that runs the spider from the script (located within the spider file):

def spiderCrawl(bandname):
    aSpider = MySpider3()
    aSpider.create_link(bandname)
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    d = runner.crawl(aSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

function that creates the url:

def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        start_urls = [tc_url]

Also, below is an image of the terminal with the error message. The fact that a random bandname was entered suggests that the url wasn't even read in the first place. What could be the problem, here? Any help would be appreciated, thanks.

enter image description here

Update:

So it seems that the problem was that my create_link method inside of the spider class wasn't properly adding the link to the start_urls list, but the script does seem to be running the spider when I use the raw_input statement inside of the spider file as opposed to the script. What would be the proper way to pass the argument of the user's input to the spider file to be added as a link? I have the code for the spider and the script running the spider below to make the post more complete:

script code

from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider 

bandname = raw_input("Enter bandname\n") # I took out this line and added it to the spider file to make the script work
tc_spider.spiderCrawl(bandname)

spider file

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]



    start_urls = [tc_url]
        tickets_list_xpath = './/div[@class = "vevent"]'
        def create_link(self, bandname):
            tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
            self.start_urls = [tc_url]
            #return tc_url      

        tickets_list_xpath = './/div[@class = "vevent"]'

        def parse_json(self, response):
            loader = response.meta['loader']
            jsonresponse = json.loads(response.body_as_unicode())
            ticket_info = jsonresponse.get('B')
            price_list = [i.get('P') for i in ticket_info]
            if len(price_list) > 0:
                str_Price = str(price_list[0])
                ticketPrice = unicode(str_Price, "utf-8")
                loader.add_value('ticketPrice', ticketPrice)
            else:
                ticketPrice = unicode("sold out", "utf-8")
                loader.add_value('ticketPrice', ticketPrice)
            return loader.load_item()

        def parse_price(self, response):
            print "parse price function entered \n"
            loader = response.meta['loader']
            event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
            eventCity = ''.join(event_City) 
            loader.add_value('eventCity' , eventCity)
            event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
            eventState = ''.join(event_State) 
            loader.add_value('eventState' , eventState) 
            event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
            eventDate = ''.join(event_Date)  
            loader.add_value('eventDate' , eventDate)    
            ticketsLink = loader.get_output_value("ticketsLink")
            json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
            json_id=  "".join(json_id_list)
            json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
            yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

        def parse(self, response):
            """
            # """
            selector = HtmlXPathSelector(response)
            # iterate over tickets
            for ticket in selector.select(self.tickets_list_xpath):
                loader = XPathItemLoader(ComparatorItem(), selector=ticket)
                # define loader
                loader.default_input_processor = MapCompose(unicode.strip)
                loader.default_output_processor = Join()
                # iterate over fields and add xpaths to the loader
                loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
                loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
                loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
                #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
                #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

                print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
                #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
                ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
                ticketsURL = urljoin(response.url, ticketsURL)
                yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

    def spiderCrawl(bandname):
        # process = CrawlerProcess({
        #     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        # })
        # process.crawl(aSpider)
        # process.start()
        aSpider = MySpider3()
        #aSpider.create_link(bandname)
        configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
        runner = CrawlerRunner()
        d = runner.crawl(aSpider)
        d.addBoth(lambda _: reactor.stop())
        reactor.run()

Upvotes: 1

Views: 445

Answers (1)

starrify
starrify

Reputation: 14731

I could only guess since you didn't provide a MCVE. However I'd say in your function create_link, this line:

start_urls = [tc_url]

should really be:

self.start_urls = [tc_url]

Upvotes: 1

Related Questions