OMGPOP
OMGPOP

Reputation: 931

use scrapy to crawl an xml webpage

I need to crawl an xml page http://www.10why.net/sitemap.xml which is just a table of urls that i want

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re

thename = "sitemap"

class ReviewSpider(BaseSpider):
    name = thename
    allowed_domains = ['10why.net']
    start_urls = ['http://www.10why.net/sitemap.xml']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        content = hxs.select('//table[@cellpadding="5"]/tbody//a')

        print content
        for c in content:


            file = open('%s.txt' % thename, 'a')
            file.write("\n")
            file.write(c)
            file.close()

The content printed is [] (empty list) I use to be able to crawl things on a normal html page instead of a site map xml page. Please help me. PS: I write the file by myself for other reasons.

Upvotes: 0

Views: 2447

Answers (1)

Mike Christensen
Mike Christensen

Reputation: 91580

I'm going to guess this is because you're looking at the HTML your browser is using to show the XML rather than the raw XML as it comes from the server. When I look at the given URL, I see an XML structure similar to:

<urlset>
   <url>
      <loc>http://www.10why.net/20130321/bb-nuan/</loc>
      <lastmod>2013-03-21T01:51:31+00:00</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.2</priority>
   </url>
</urlset>

You might want to use an XPath expression more like:

//urlset/url/loc

To get all URLs in the site map.

Upvotes: 2

Related Questions