geotheory
geotheory

Reputation: 23630

Parsing XML to data.frame in R

Lots of questions on this, but can't find solution suiting this data format. Grateful for advice on how to parse this:

<XML>
<constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
    <name text="Aberavon"/>
</constituency>
<constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
    <name text="Aberdeen Central"/>
</constituency>
<constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
    <name text="Aberdeen North"/>
</constituency>
</XML>

The desired fields are evidently c('hansard_id','id','fromdate','todate','name'). To read in and parse I've tried the following:

require(XML)
> indata = htmlParse('data.xml', isHTML=F)
> class(indata)
[1] "XMLInternalDocument" "XMLAbstractDocument"
> print(indata)
<?xml version="1.0"?>
<XML>
  <constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
    <name text="Aberavon"/>
  </constituency>
  <constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
    <name text="Aberdeen Central"/>
  </constituency>
  <constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
    <name text="Aberdeen North"/>
  </constituency>
</XML>

> xmlToDataFrame(indata, stringsAsFactors=F)
  name
1     
2     
3     

It's reading in ok, but xmlToDataFrame can't handle the format. Is it because the data are attributes to the 'constituency' tag elements? Very grateful for any guidance.

Upvotes: 3

Views: 1496

Answers (1)

jdharrison
jdharrison

Reputation: 30425

You are correct that xmlToDataFrame only access the XML nodes. For a given node the xmlAttrs function will return that nodes attributes. The xpathApply function takes a parsed xml document doc say and applies an xpath to it to get a set of nodes. Each of these nodes is then applied to a function which a user defines. The xpath "//*/constituency" will return all the constituency nodes in your document. We can then apply the xmlAttrs function to each:

res <- xpathApply(doc, "//*/constituency", xmlAttrs)

This will return us a list of attributes. We would like to bind these together for example:

rbind.data.frame(res[[1]], res[[2]], ...)

would bind the first and second, third, ... set of attributes together into a data.frame. A short way of doing this is to use the do.call function on out list of attributes:

do.call(rbind.data.frame, res)

will apply the row bind to all the elements of our list.

Upvotes: 2

Related Questions