Reputation: 97
I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:
1) I would like to extract the nodes of individual news stories using xmlChildren
on the parsed document as follows:
library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)
Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>
, it does not work in this particular case with nodes <entry>
as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.
2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item>
or in node <entry>
without knowing the particular structure in advance?
Any help is very much appreciated, thank you.
Upvotes: 3
Views: 484
Reputation: 18500
This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining?
If you do not want to use the package, you can still inspect the code and see how they parsed the data.
Upvotes: 1
Reputation: 78832
You'll need to work with namespaces. Here are XML
and xml2
options:
# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)
# xml2
library(xml2)
XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)
Look at using the boolean() XPath operator to be able to handle multiple cases (i.e. the different feed formats).
Upvotes: 2