Reputation: 931
I need to crawl an xml page http://www.10why.net/sitemap.xml which is just a table of urls that i want
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
thename = "sitemap"
class ReviewSpider(BaseSpider):
name = thename
allowed_domains = ['10why.net']
start_urls = ['http://www.10why.net/sitemap.xml']
def parse(self, response):
hxs = HtmlXPathSelector(response)
content = hxs.select('//table[@cellpadding="5"]/tbody//a')
print content
for c in content:
file = open('%s.txt' % thename, 'a')
file.write("\n")
file.write(c)
file.close()
The content printed is [] (empty list) I use to be able to crawl things on a normal html page instead of a site map xml page. Please help me. PS: I write the file by myself for other reasons.
Upvotes: 0
Views: 2447
Reputation: 91580
I'm going to guess this is because you're looking at the HTML your browser is using to show the XML rather than the raw XML as it comes from the server. When I look at the given URL, I see an XML structure similar to:
<urlset>
<url>
<loc>http://www.10why.net/20130321/bb-nuan/</loc>
<lastmod>2013-03-21T01:51:31+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.2</priority>
</url>
</urlset>
You might want to use an XPath expression more like:
//urlset/url/loc
To get all URLs in the site map.
Upvotes: 2