Nico21
Nico21

Reputation: 97

Parse RSS Feeds with variable XML structures in R

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows:

library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>, it does not work in this particular case with nodes <entry> as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance?

Any help is very much appreciated, thank you.

Upvotes: 3

Views: 484

Answers (2)

Karsten W.
Karsten W.

Reputation: 18500

This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining?

If you do not want to use the package, you can still inspect the code and see how they parsed the data.

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78832

You'll need to work with namespaces. Here are XML and xml2 options:

# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)

# xml2
library(xml2)

XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)

Look at using the boolean() XPath operator to be able to handle multiple cases (i.e. the different feed formats).

Upvotes: 2

Related Questions