Reputation: 23630
Lots of questions on this, but can't find solution suiting this data format. Grateful for advice on how to parse this:
<XML>
<constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
<name text="Aberavon"/>
</constituency>
<constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
<name text="Aberdeen Central"/>
</constituency>
<constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
<name text="Aberdeen North"/>
</constituency>
</XML>
The desired fields are evidently c('hansard_id','id','fromdate','todate','name')
. To read in and parse I've tried the following:
require(XML)
> indata = htmlParse('data.xml', isHTML=F)
> class(indata)
[1] "XMLInternalDocument" "XMLAbstractDocument"
> print(indata)
<?xml version="1.0"?>
<XML>
<constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
<name text="Aberavon"/>
</constituency>
<constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
<name text="Aberdeen Central"/>
</constituency>
<constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
<name text="Aberdeen North"/>
</constituency>
</XML>
> xmlToDataFrame(indata, stringsAsFactors=F)
name
1
2
3
It's reading in ok, but xmlToDataFrame
can't handle the format. Is it because the data are attributes to the 'constituency' tag elements? Very grateful for any guidance.
Upvotes: 3
Views: 1496
Reputation: 30425
You are correct that xmlToDataFrame
only access the XML
nodes. For a given node the xmlAttrs
function will return that nodes attributes. The xpathApply
function takes a parsed xml document doc
say and applies an xpath
to it to get a set of nodes. Each of these nodes is then applied to a function which a user defines. The xpath
"//*/constituency"
will return all the constituency
nodes in your document. We can then apply the xmlAttrs
function to each:
res <- xpathApply(doc, "//*/constituency", xmlAttrs)
This will return us a list of attributes. We would like to bind these together for example:
rbind.data.frame(res[[1]], res[[2]], ...)
would bind the first and second, third, ... set of attributes together into a data.frame. A short way of doing this is to use the do.call
function on out list of attributes:
do.call(rbind.data.frame, res)
will apply the row bind to all the elements of our list.
Upvotes: 2