user997943
user997943

Reputation: 323

R Fast XML Parsing

What is the fastest way to convert XML files to data frames in R currently?

The XML looks like this: (Note- not all rows have all fields)

  <row>
    <ID>001</ID>
    <age>50</age>
    <field3>blah</field3>
    <field4 />
  </row>
  <row>
    <ID>001</ID>
    <age>50</age>
    <field4 />
  </row>

I have tried two approaches:

  1. The xmlToDataFrame function from the XML library
  2. The speed oriented xmlToDF function posted here

For an 8.5 MB file, with 1.6k "rows" and 114 "columns", xmlToDataFrame took 25.1 seconds, while xmlToDF took 16.7 seconds on my machine.

These times are quite large, when compared with python XML parsers (eg. xml.etree.ElementTree) which was able to do the job in 0.4 seconds.

Is there a faster way to do this in R, or is there something fundamental in R that prevents us making this faster?

Some light on this would be really helpful!

Upvotes: 7

Views: 4660

Answers (2)

martin
martin

Reputation: 123

Just in case it helps someone, I found this solution using data.table to be even faster in my use case, as it only converts data to data.table once is has finished looping over the rows:

library(XML)
library(data.table)

doc <- xmlParse(filename)
d <- getNodeSet(doc,  "//Data")
size <- xmlSize(d)

dt <- rbindlist(lapply(1:size, function(i) {
    as.list(getChildrenStrings(d[[i]]))
}))

Upvotes: 5

Randy Lai
Randy Lai

Reputation: 3184

Updated for the comments

d = xmlRoot(doc)
size = xmlSize(d)

names = NULL
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    names = unique(c(names, names(v)))
}

for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}

This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.

Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.


if you want to construct the data.frame dynamically, you can do this with data.table, data.table is a little bit slower than the above csv method, but faster than data.frame

m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]

It finishes in about 1.1 second for the same document.

Upvotes: 4

Related Questions