Parse RSS Feeds with variable XML structures in R

Question

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows:

library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes , it does not work in this particular case with nodes as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node or in node without knowing the particular structure in advance?

Any help is very much appreciated, thank you.

hrbrmstr · Accepted Answer

You'll need to work with namespaces. Here are XML and xml2 options:

# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)

# xml2
library(xml2)

XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)

Look at using the boolean() XPath operator to be able to handle multiple cases (i.e. the different feed formats).

Parse RSS Feeds with variable XML structures in R

Answers (2)

Related Questions