user2927435
user2927435

Reputation: 33

Using Scrapy for XML page

I'm trying to scrape multiple pages from an API to practice and develop my XML scrapping. One issue that has arisen is that when I try to scrape a document formatted like this: https://i.sstatic.net/epd7t.png and store it as an XML it fails to do so.

So within the CMD it fetches the URL it creates the XML file on my computer but there's nothing in it.

How would I fix it to echo out the whole document or even parts of it?

I put the code below:

from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from doitapi.items import DoIt
import random

class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["do-it.org.uk"]
    start_urls = []
    number = []
    for count in range(100):
        number.append(random.randint(2000000,2500000))


    for i in number:
        start_urls.append("http://www.do-it.org.uk/syndication/opportunities/%d?apiKey=XXXXX-XXXX-XXX-XXX-XXXXX" %i)



       def parse(self, response):
    xxs = XmlXPathSelector(response)
    titles = xxs.register_namespace("d", "http://www.do-it.org.uk/volunteering-opportunity")
    items = []
    for titles in titles:
        item = DoIt()
        item ["url"] = response.url
        item ["name"] = titles.select("//d:title").extract()
        item ["description"] = titles.select("//d:description").extract()
        item ["username"] = titles.select("//d:info-provider/name").extract()
        item ["location"] = titles.select("//d:info-provider/address").extract()
        items.append(item)
    return items

Upvotes: 3

Views: 5571

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

Your XML file is using the namespace "http://www.do-it.org.uk/volunteering-opportunity" so to select title, name etc. you have 2 choices:

  • either use xxs.remove_namespaces() once and then use .select("./title"), .select("./description") etc.
  • or register the namespace once, with a prefix like "doit", xxs.register_namespace("doit", "http://www.do-it.org.uk/volunteering-opportunity"), and then use .select("./doit:title"), .select("./doit:description") etc.

For more details on XML namespaces, see this page in the FAQ and this page in the docs

Upvotes: 4

Related Questions