Reputation: 81
I try to create my first spider scraper using scrapy I use Dmoz as test, I get an error message: TypeError: Request url must be str or unicode, got NoneType But in the Debug I can see the right url
Code:
import scrapy
import urlparse
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = ["http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all"]
def parse(self, response):
sites = response.css('#site-list-content > div.site-item > div.title-and-desc')
for site in sites:
yield {
'name': site.css('a > div.site-title::text').extract_first().strip(),
'url': site.xpath('a/@href').extract_first().strip(),
'description': site.css('div.site-descr::text').extract_first().strip(),
}
nxt = response.css('#subcategories-div > div.previous-next > div.next-page')
next_page = nxt.css('a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Logs:
2016-10-18 11:17:03 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
2016-10-18 11:17:03 [scrapy] ERROR: Spider error processing <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/ENV/bin/tutorial/dirbot/spiders/dmoz.py", line 25, in parse
yield scrapy.Request(next_page, callback=self.parse)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 51, in _set_url
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2016-10-18 11:17:03 [scrapy] INFO: Closing spider (finished)
2016-10-18 11:17:03 [scrapy] INFO: Stored json feed (20 items) in: test.json
2016-10-18 11:17:03 [scrapy] INFO: Dumping Scrapy stats:
Upvotes: 2
Views: 6847
Reputation: 3691
The error is in your code here:
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
As Padraic Cunningham mentions in his commit: you yield
the Request
regardless of next_page
is None
or filled with an URL.
You can solve your problem by changing your code to this:
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
where you put your yield
inside your if
block.
By the way you can change your if
to the following:
if next_page:
because of Python's truth.
And because your spider stops working try to debug your application through scrapy shell where you can see if your CSS queries return values or not. You can also add an else
to the previous if
block which logs / prints a statement to the console that no next_page
was found so you know that something is wrong with the site or your CSS queries.
Upvotes: 2