Dervin Thunk
Dervin Thunk

Reputation: 20119

Extracting links from xml using scrapy

I have an xml page with the following structure:

<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
  ...

And I would like to extract and follow links within the <links> tag.

I only have the default code for this:

start_urls = ['example.com/example.xml']

rules = (
    Rule(LinkExtractor(allow="example.com"),
          callback='parse_item',),
)

But I don't know how to follow "text" tags. I've actually tried the linkextractor tags='links' option, but to no avail. The log effectively goes to the page, gets a 200 reply, but does not get any links .

Upvotes: 3

Views: 805

Answers (2)

alecxe
alecxe

Reputation: 473753

The key problem here is that this is not a regular HTML input, but an XML feed and the links are inside the elements texts and not the attributes. I think you just need the XMLFeedSpider here:

import scrapy
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):
    name = 'myspider'
    start_urls = ['url_here']

    itertag = "item"

    def parse_node(self, response, node):
        for link in node.xpath(".//link/text()").extract():
            yield scrapy.Request(link.strip(), callback=self.parse_link)

    def parse_link(self, response):
        print(response.url)

Upvotes: 1

user565447
user565447

Reputation: 999

You should use xml.etree library.

import xml.etree.ElementTree as ET



res = '''
<item>
  <pubDate>Sat, 12 Dec 2015 16:35:00 GMT</pubDate>
  <title>
   some text
  </title>
  <link>
     http://www.example.com/index.xml
  </link>
</item>
'''

root = ET.fromstring(res)
results = root.findall('.//link')
for res in results:
    print res.text

The output will be as follows:

http://www.example.com/index.xml

Upvotes: 0

Related Questions