Reputation: 20119
I have an xml page with the following structure:
<item>
<pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
<title>
some text
</title>
<link>
http://www.example.com/index.xml
</link>
...
And I would like to extract and follow links within the <links>
tag.
I only have the default code for this:
start_urls = ['example.com/example.xml']
rules = (
Rule(LinkExtractor(allow="example.com"),
callback='parse_item',),
)
But I don't know how to follow "text" tags. I've actually tried the linkextractor
tags='links'
option, but to no avail. The log effectively goes to the page, gets a 200 reply, but does not get any links .
Upvotes: 3
Views: 805
Reputation: 473753
The key problem here is that this is not a regular HTML input, but an XML feed and the links are inside the elements texts and not the attributes. I think you just need the XMLFeedSpider
here:
import scrapy
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = 'myspider'
start_urls = ['url_here']
itertag = "item"
def parse_node(self, response, node):
for link in node.xpath(".//link/text()").extract():
yield scrapy.Request(link.strip(), callback=self.parse_link)
def parse_link(self, response):
print(response.url)
Upvotes: 1
Reputation: 999
You should use xml.etree library.
import xml.etree.ElementTree as ET
res = '''
<item>
<pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
<title>
some text
</title>
<link>
http://www.example.com/index.xml
</link>
</item>
'''
root = ET.fromstring(res)
results = root.findall('.//link')
for res in results:
print res.text
The output will be as follows:
http://www.example.com/index.xml
Upvotes: 0