Reputation: 89
The following code is being used to parse XML in order to extract information like node, parent, type and so on into a data frame. It works fine for a small XML file of lines but when a file of greater than 25,000 lines is used it takes a couple of minutes to process. Hence I intend optimizing the code to process faster. The aim of the function is to read any XML file and generate data as required by the data frame.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
<PLANT id="1" required="false">
<COMMON Source="NLM">Bloodroot</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<DATE>
<Year>2013</Year>
</DATE>
</PLANT>
<PLANT id="2" required="true">
<COMMON Source="LNP">Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<DATE>
<Year>2014</Year>
</DATE>
</PLANT>
</CATALOG>
Output:
path node value parent type
1 CATALOG CATALOG NULL NULL element
2 CATALOG/PLANT PLANT NULL CATALOG element
3 CATALOG/PLANT id 1 PLANT attribute
4 CATALOG/PLANT required false PLANT attribute
5 CATALOG/PLANT/COMMON COMMON Bloodroot PLANT text
6 CATALOG/PLANT/COMMON Source NLM COMMON attribute
7 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis PLANT text
8 CATALOG/PLANT/DATE DATE NULL PLANT element
9 CATALOG/PLANT/DATE/Year Year 2013 DATE text
10 CATALOG/PLANT PLANT NULL CATALOG element
11 CATALOG/PLANT id 2 PLANT attribute
12 CATALOG/PLANT required true PLANT attribute
13 CATALOG/PLANT/COMMON COMMON Columbine PLANT text
14 CATALOG/PLANT/COMMON Source LNP COMMON attribute
15 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis PLANT text
16 CATALOG/PLANT/DATE DATE NULL PLANT element
17 CATALOG/PLANT/DATE/Year Year 2014 DATE text
Code Snippet:
library(XML)
library(plyr)
## helper function of xPathApply
getValues <- function(x) {
List <- list()
# find all ancestors of a given node
ancestorNames <- character()
ancestorNamesList <- xmlAncestors(x, fun = function(y) {
ancestorNames <- c(ancestorNames, xmlName(y))})
pathName <- paste(ancestorNamesList, collapse = "/")
# find the parent of a given node
parentNode <- xmlParent(x)
parentName <- "NULL"
if(!is.null(parentNode)) {
parentName <- xmlName(parentNode)
}
if(inherits(x, "XMLInternalElementNode")) {
# check if the value of the given node exists i.e. text
if(length(xmlValue(x, recursive=FALSE)) != 0) {
List <- append(List, list(path = pathName, node = xmlName(x), value = xmlValue(x, recursive=FALSE), parent = parentName, type = "text"))
} else {
List <- append(List, list(path = pathName, node = xmlName(x), value = "NULL", parent = parentName, type = "element"))
}
}
## attributes
if(!is.null(xmlAttrs(x))) {
num.attributes = xmlSize(xmlAttrs(x))
for (i in seq_len(num.attributes)) {
# get the attribute name
attributeName <- names(xmlAttrs(x)[i])
# get the attribute value
attributeValue <- xmlAttrs(x)[[i]]
List <- append(List, list(path = pathName, node = attributeName, value = attributeValue, parent = parentName, type = "attribute"))
}
}
return(List)
}
## recursive function
visitNode <- function(node, xpath) {
if (is.null(node)) {
return()
}
# number of children of a node
num.children <- xmlSize(node)
bypass <- function(n = num.children) {
if(num.children == 0) {
xpathSApply(node, path = xpath, getValues)
} else {
return(num.children)
}
}
# recursive call to visitNode
for (i in seq_len(num.children)) {
visitNode(node[[i]], xpath)
}
# add list type result to data frame
if(is.list(result <- bypass())) {
dt <<- do.call(rbind.fill, lapply(result, data.frame))
}
}
# read XML data from the given file
xtree <- xmlParse("test.xml")
# retrieve the root of the XML
root <- xmlRoot(xtree)
# define data frame which is to hold the data interpreted from XML
dt <- data.frame(path = NA, node = NA, value = NA, parent = NA, type = NA)
# call to recursive function
visitNode(root, xpath <- "//node()")
dt
Upvotes: 2
Views: 521
Reputation: 206197
I really wish there was good XSLT support inR but i can't seem to find a great package for it. A different strategy would be to transform the xml into a simpler data file that you can easily read with read.table
or something else. You can pass it pretty easily with xmlEventParse
. Here's a custom handler which seems to create the data you want
getHandler<-function(file="", sep=",") {
list(.startDocument = function(.state) {
cat("path","node","value","parent","type", file=file, sep=sep)
cat("\n", file=file, sep=sep, append=T)
.state
}, .startElement=function(name, atts, .state) {
.state$path <- c(.state$path, name)
cat(paste(.state$path, collapse="/"), name, NA, .state$path[length(.state$path)-1], "element", sep=sep, file=file, append=T)
cat("\n", file=file, append=T)
if(!is.null(atts)) {
cat(paste(paste(.state$path, collapse="/"), names(atts), atts, .state$path[length(.state$path)-1], "attribute", sep=sep, collapse="\n"), file=file, append=T)
cat("\n",file=file, append=T)
}
.state
}, .endElement=function(name, .state) {
.state$path <- .state$path[-length(.state$path)]
.state
}, .text=function(value, .state) {
value <- gsub("^\\s+|\\s+$", "", value)
if(nchar(value)>0) {
cat(paste(.state$path, collapse="/"), .state$path[length(.state$path)], value, .state$path[length(.state$path)-1], "text", sep=sep, file=file, append=T)
cat("\n", file=file, append=T)
}
.state
})
}
So it's not exactly pretty but it's basically just building a string with cat()
. We can then use it with
zz <- xmlEventParse("test.xml",
handlers = getHandler(),
state = list(path=character(0)), useDotNames=TRUE)
This will output the the data with comma separated values to the screen. To save to a file, you can do
zz <- xmlEventParse("test.xml",
handlers = getHandler(file="ok.txt", sep="\t"),
state = list(path=character(0)), useDotNames=TRUE)
which will write the data as delimited to a file named "ok.txt". You can then read the data in with
read.table("ok.txt", sep="\t", header=T)
which returns
path node value parent type
1 CATALOG CATALOG <NA> element
2 CATALOG/PLANT PLANT <NA> CATALOG element
3 CATALOG/PLANT id 1 CATALOG attribute
4 CATALOG/PLANT required false CATALOG attribute
5 CATALOG/PLANT/COMMON COMMON <NA> PLANT element
6 CATALOG/PLANT/COMMON Source NLM PLANT attribute
7 CATALOG/PLANT/COMMON COMMON Bloodroot PLANT text
8 CATALOG/PLANT/BOTANICAL BOTANICAL <NA> PLANT element
9 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis PLANT text
10 CATALOG/PLANT/DATE DATE <NA> PLANT element
11 CATALOG/PLANT/DATE/Year Year <NA> DATE element
12 CATALOG/PLANT/DATE/Year Year 2013 DATE text
13 CATALOG/PLANT PLANT <NA> CATALOG element
14 CATALOG/PLANT id 2 CATALOG attribute
15 CATALOG/PLANT required true CATALOG attribute
16 CATALOG/PLANT/COMMON COMMON <NA> PLANT element
17 CATALOG/PLANT/COMMON Source LNP PLANT attribute
18 CATALOG/PLANT/COMMON COMMON Columbine PLANT text
19 CATALOG/PLANT/BOTANICAL BOTANICAL <NA> PLANT element
20 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis PLANT text
21 CATALOG/PLANT/DATE DATE <NA> PLANT element
22 CATALOG/PLANT/DATE/Year Year <NA> DATE element
23 CATALOG/PLANT/DATE/Year Year 2014 DATE text
Now there are more rows then you had in your sample, but some of the selection rules weren't that clear to me.
The main idea is that xmlEventParse
is more efficient than xmlParse
because it doesn't have to load the entire tree. Additionally by using cat()
to dump to a file, i don't have to worry about memory management right away (but it's not exactly like writing to disk is all that great either).
Anyway, it's at least another option to consider.
Upvotes: 4