Cautious Brick
Cautious Brick

Reputation: 79

Scrapy Request 'IndentationError: unexpected indent' on parse callback

I'm using Scrapy CLI and running it on a ubuntu 18 server, I am trying to avoid hardcoding a bunch of urls in the start_urls property, but instead calling a 'yield scrapy.Request()' at the bottom of my parse. The website I'm scraping is fairly basic and has different pages for the years 2014-2030. At the bottom of my code I have an if() function to check the current year and move the scraper to the next year's page. I'm new to scrapy in general so I'm not sure if I'm calling the scrapy.Request() method correctly. Here is my code:

import scrapy

from .. import items

class EventSpider(scrapy.Spider):
    name = "event_spider"
    start_urls = [
        "http://www.seasky.org/astronomy/astronomy-calendar-2014.html",
    ]
    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
    start_year = 2014
    
    #response is the website
    def parse(self, response):
        CONTENT_SELECTOR = 'div#right-column-content ul li'
        
        for astro_event in response.css(CONTENT_SELECTOR):
            NAME_SELECTOR = "p span.title-text ::text"
            DATE_SELECTOR = "p span.date-text ::text"
            DESCRIPTION_SELECTOR = "p ::text"
            
            item = items.AstroEventsItem()
            
            item["title"] = astro_event.css(NAME_SELECTOR).extract_first()
            item["date"] = astro_event.css(DATE_SELECTOR).extract_first()
            item["description"] = astro_event.css(DESCRIPTION_SELECTOR)[-1].extract()
            
            yield item
               
        #Next page code:
        #Goes through years 2014 to 2030
        if(self.start_year < 2030):
            self.start_year = self.start_year + 1
            new_url = "http://www.seasky.org/astronomy/astronomy-calendar-" + str(self.start_year) + ".html"
            print(new_url)
            yield scrapy.Request(new_url, callback = self.parse)

Here is the error I'm receiving after it successfully scrapes the first page:

2020-11-10 05:25:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.seasky.org/astronomy/astronomy-calendar-2015.html> (referer: http://www.seasky.org/astronomy/astronomy-calendar-2014.html)
2020-11-10 05:25:50 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.seasky.org/astronomy/astronomy-calendar-2015.html> (referer: http://www.seasky.org/astronomy/astronomy-calendar-2014.html)
Traceback (most recent call last):
  File "/home/jcmq6b/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <200 http://www.seasky.org/astronomy/astronomy-calendar-2015.html>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 58, in process_spider_input
    return scrape_func(response, request, spider)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/scraper.py", line 149, in call_spider
    warn_on_generator_with_return_value(spider, callback)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 245, in warn_on_generator_with_return_value
    if is_generator_with_return_value(callable):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 230, in is_generator_with_return_value
    tree = ast.parse(dedent(inspect.getsource(callable)))
  File "/usr/lib/python3.6/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    def parse(self, response):
    ^
IndentationError: unexpected indent

I think I'm maybe not passing the correct parameters in order to callback the parse method, but I'm not sure. Any help is much appreciated! Let me know if I need to post more information.

Upvotes: 4

Views: 814

Answers (2)

Georgiy
Georgiy

Reputation: 3561

Code line that gave this error tree = ast.parse(dedent(inspect.getsource(callable))) on File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 230, in is_generator_with_return_value.
removed as result of scrapy pull request 4935 related to previously mentioned on comments scrapy/issue/4477
To prevent this - it is recommended to update version of scrapy to newer version.. at least to 2.5.0

Mention of this on scrapy (v.2.5.0) release notes - https://docs.scrapy.org/en/2.11/news.html?highlight=indentation#id40

Update

In case if (by some reason) user.. don't have option to update scrapy version it is possible to.. "disable" warn_on_generator_with_return_value(spider, callback) check on

  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/scraper.py", line 149, in call_spider
    warn_on_generator_with_return_value(spider, callback)

by.. monkey patching warn_on_generator_with_return_value itself by adding something like this to spider code:

import scrapy.utils.misc
import scrapy.core.scraper

def warn_on_generator_with_return_value_stub(spider, callable):
    pass

scrapy.utils.misc.warn_on_generator_with_return_value = warn_on_generator_with_return_value_stub
scrapy.core.scraper.warn_on_generator_with_return_value = warn_on_generator_with_return_value_stub

as mentioned on this answer and later on other answer

Upvotes: 3

Cautious Brick
Cautious Brick

Reputation: 79

To anyone who stumbles on this, I didn't find a reason for the indentation error, but I did find a work around by separating my code into two different parse methods:

import scrapy

from .. import items

class EventSpider(scrapy.Spider):
    name = "event_spider"
    start_urls = ["http://www.seasky.org/astronomy/astronomy-calendar-2014.html"]
    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
    start_year = 2014
    
    #Next page code:
    def parse(self, response):
        #Goes through years 2014 to 2030 from the href links at top of page
        for href in response.css("div#top-links div h3 a::attr(href)"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_contents)

            
    
    #parses items for database
    def parse_contents(self, response):
        
        CONTENT_SELECTOR = 'div#right-column-content ul li'
        
        for astro_event in response.css(CONTENT_SELECTOR):
            NAME_SELECTOR = "p span.title-text ::text"
            DATE_SELECTOR = "p span.date-text ::text"
            DESCRIPTION_SELECTOR = "p ::text"
            
            item = items.AstroEventsItem()
            
            item["title"] = astro_event.css(NAME_SELECTOR).extract_first()
            item["date"] = astro_event.css(DATE_SELECTOR).extract_first()
            item["description"] = astro_event.css(DESCRIPTION_SELECTOR)[-1].extract()
            
            yield item

The first parse is to get the url's from the hrefs listed on the site. It then for each href calls the second parse method parse_contents and processes the information scraped form the page into items for MongoDB. Hoping this might help someone out if they have a similar issue.

Upvotes: 0

Related Questions