Eagle
Eagle

Reputation: 81

Scrapy Request url must be str or unicode, got NoneType:

I try to create my first spider scraper using scrapy I use Dmoz as test, I get an error message: TypeError: Request url must be str or unicode, got NoneType But in the Debug I can see the right url

Code:

import scrapy
import urlparse


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = ["http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all"]

    def parse(self, response):
        sites = response.css('#site-list-content > div.site-item > div.title-and-desc')
        
        for site in sites:
            yield {
                'name': site.css('a > div.site-title::text').extract_first().strip(),
                'url': site.xpath('a/@href').extract_first().strip(),
                'description': site.css('div.site-descr::text').extract_first().strip(),
            }

        nxt = response.css('#subcategories-div > div.previous-next > div.next-page')
        next_page = nxt.css('a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)          

        yield scrapy.Request(next_page, callback=self.parse)

Logs:

2016-10-18 11:17:03 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
2016-10-18 11:17:03 [scrapy] ERROR: Spider error processing <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/ENV/bin/tutorial/dirbot/spiders/dmoz.py", line 25, in parse
    yield scrapy.Request(next_page, callback=self.parse)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 51, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2016-10-18 11:17:03 [scrapy] INFO: Closing spider (finished)
2016-10-18 11:17:03 [scrapy] INFO: Stored json feed (20 items) in: test.json
2016-10-18 11:17:03 [scrapy] INFO: Dumping Scrapy stats:

Upvotes: 2

Views: 6847

Answers (1)

GHajba
GHajba

Reputation: 3691

The error is in your code here:

if next_page is not None:
    next_page = response.urljoin(next_page)          

yield scrapy.Request(next_page, callback=self.parse)

As Padraic Cunningham mentions in his commit: you yield the Request regardless of next_page is None or filled with an URL.

You can solve your problem by changing your code to this:

if next_page is not None:
    next_page = response.urljoin(next_page)          
    yield scrapy.Request(next_page, callback=self.parse)

where you put your yield inside your if block.

By the way you can change your if to the following:

if next_page:

because of Python's truth.

And because your spider stops working try to debug your application through scrapy shell where you can see if your CSS queries return values or not. You can also add an else to the previous if block which logs / prints a statement to the console that no next_page was found so you know that something is wrong with the site or your CSS queries.

Upvotes: 2

Related Questions