Reputation: 19312
Tried the example scrapy usage appearing on the documentation page (example under the name: Return multiple Requests and items from a single callback)
I just changed the domains to point to a real website:
import scrapy
class MySpider(scrapy.Spider):
name = 'huffingtonpost'
allowed_domains = ['huffingtonpost.com/']
start_urls = [
'http://www.huffingtonpost.com/politics/',
'http://www.huffingtonpost.com/entertainment/',
'http://www.huffingtonpost.com/media/',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield {"title": h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
But getting ValuError
as posted in this gist.
Any ideas?
Upvotes: 4
Views: 207
Reputation: 15090
Some extracted links are relative (for example, /news/hillary-clinton/
).
You should transform it into absolute (http://www.huffingtonpost.com/news/hillary-clinton/
import scrapy
class MySpider(scrapy.Spider):
name = 'huffingtonpost'
allowed_domains = ['huffingtonpost.com/']
start_urls = [
'http://www.huffingtonpost.com/politics/',
'http://www.huffingtonpost.com/entertainment/',
'http://www.huffingtonpost.com/media/',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield {"title": h3}
for url in response.xpath('//a/@href').extract():
if url.startswith('/'):
# transform url into absolute
url = 'http://www.huffingtonpost.com' + url
if url.startswith('#'):
# ignore href starts with #
continue
yield scrapy.Request(url, callback=self.parse)
Upvotes: 4