Reputation: 33
I'm trying to scrape multiple pages from an API to practice and develop my XML scrapping. One issue that has arisen is that when I try to scrape a document formatted like this: https://i.sstatic.net/epd7t.png and store it as an XML it fails to do so.
So within the CMD it fetches the URL it creates the XML file on my computer but there's nothing in it.
How would I fix it to echo out the whole document or even parts of it?
I put the code below:
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from doitapi.items import DoIt
import random
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["do-it.org.uk"]
start_urls = []
number = []
for count in range(100):
number.append(random.randint(2000000,2500000))
for i in number:
start_urls.append("http://www.do-it.org.uk/syndication/opportunities/%d?apiKey=XXXXX-XXXX-XXX-XXX-XXXXX" %i)
def parse(self, response):
xxs = XmlXPathSelector(response)
titles = xxs.register_namespace("d", "http://www.do-it.org.uk/volunteering-opportunity")
items = []
for titles in titles:
item = DoIt()
item ["url"] = response.url
item ["name"] = titles.select("//d:title").extract()
item ["description"] = titles.select("//d:description").extract()
item ["username"] = titles.select("//d:info-provider/name").extract()
item ["location"] = titles.select("//d:info-provider/address").extract()
items.append(item)
return items
Upvotes: 3
Views: 5571
Reputation: 20748
Your XML file is using the namespace "http://www.do-it.org.uk/volunteering-opportunity" so to select title
, name
etc. you have 2 choices:
xxs.remove_namespaces()
once and then use .select("./title")
, .select("./description")
etc.xxs.register_namespace("doit", "http://www.do-it.org.uk/volunteering-opportunity")
, and then use .select("./doit:title")
, .select("./doit:description")
etc.For more details on XML namespaces, see this page in the FAQ and this page in the docs
Upvotes: 4