Reputation: 1026
I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work...
I am using the following rule:
rules = ( Rule(SgmlLinkExtractor(tags=('link',), attrs=False), follow=True, callback='parse_article'), )
(having in mind that rss links are located in the link tag).
I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ...
Any help is welcome, Thanks in advance
Upvotes: 7
Views: 5609
Reputation: 408
XML Example From scrapy doc XMLFeedSpider
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
#item = TestItem()
item = {} # change to dict for removing the class not found error
item['id'] = node.xpath('@id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
Upvotes: -2
Reputation: 1026
I have done it using CrawlSpider:
class MySpider(CrawlSpider):
domain_name = "xml.example.com"
def parse(self, response):
xxs = XmlXPathSelector(response)
items = xxs.select('//channel/item')
for i in items:
urli = i.select('link/text()').extract()
request = Request(url=urli[0], callback=self.parse1)
yield request
def parse1(self, response):
hxs = HtmlXPathSelector(response)
# ...
yield(MyItem())
but I am not sure that is a very proper solution...
Upvotes: 0
Reputation: 1540
CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector
class MySpider(BaseSpider):
name = 'myspider'
def parse(self, response):
xxs = XmlXPathSelector(response)
links = xxs.select("//link/text()").extract()
return [Request(x, callback=self.parse_link) for x in links]
You can also try the XPath in the shell, by running for example:
scrapy shell http://blog.scrapy.org/rss.xml
And then typing in the shell:
>>> xxs.select("//link/text()").extract()
[u'http://blog.scrapy.org',
u'http://blog.scrapy.org/new-bugfix-release-0101',
u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']
Upvotes: 7