user2877232
user2877232

Reputation: 89

Speed efficiency - R for loop

The following code is being used to parse XML in order to extract information like node, parent, type and so on into a data frame. It works fine for a small XML file of lines but when a file of greater than 25,000 lines is used it takes a couple of minutes to process. Hence I intend optimizing the code to process faster. The aim of the function is to read any XML file and generate data as required by the data frame.

Sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
   <PLANT id="1" required="false">
      <COMMON Source="NLM">Bloodroot</COMMON>
      <BOTANICAL>Aquilegia canadensis</BOTANICAL>
      <DATE>
         <Year>2013</Year>
      </DATE>
   </PLANT>
   <PLANT id="2" required="true">
      <COMMON Source="LNP">Columbine</COMMON>
      <BOTANICAL>Aquilegia canadensis</BOTANICAL>
      <DATE>
         <Year>2014</Year>
      </DATE>
   </PLANT>
</CATALOG>

Output:

                      path      node                value  parent      type
1                  CATALOG   CATALOG                 NULL    NULL   element
2            CATALOG/PLANT     PLANT                 NULL CATALOG   element
3            CATALOG/PLANT        id                    1   PLANT attribute
4            CATALOG/PLANT  required                false   PLANT attribute
5     CATALOG/PLANT/COMMON    COMMON            Bloodroot   PLANT      text
6     CATALOG/PLANT/COMMON    Source                  NLM  COMMON attribute
7  CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
8       CATALOG/PLANT/DATE      DATE                 NULL   PLANT   element
9  CATALOG/PLANT/DATE/Year      Year                 2013    DATE      text
10           CATALOG/PLANT     PLANT                 NULL CATALOG   element
11           CATALOG/PLANT        id                    2   PLANT attribute
12           CATALOG/PLANT  required                 true   PLANT attribute
13    CATALOG/PLANT/COMMON    COMMON            Columbine   PLANT      text
14    CATALOG/PLANT/COMMON    Source                  LNP  COMMON attribute
15 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
16      CATALOG/PLANT/DATE      DATE                 NULL   PLANT   element
17 CATALOG/PLANT/DATE/Year      Year                 2014    DATE      text

Code Snippet:

library(XML)
library(plyr)

## helper function of xPathApply
getValues <- function(x) {
  List <- list()

  # find all ancestors of a given node
  ancestorNames <- character()  
  ancestorNamesList <- xmlAncestors(x, fun = function(y) {
    ancestorNames <- c(ancestorNames, xmlName(y))})  
  pathName <- paste(ancestorNamesList, collapse = "/")

  # find the parent of a given node
  parentNode <- xmlParent(x)
  parentName <- "NULL"
  if(!is.null(parentNode)) {
    parentName <- xmlName(parentNode)
  } 

  if(inherits(x, "XMLInternalElementNode")) {
    # check if the value of the given node exists i.e. text
    if(length(xmlValue(x, recursive=FALSE)) != 0) {
      List <- append(List, list(path = pathName, node = xmlName(x), value = xmlValue(x, recursive=FALSE), parent = parentName, type = "text"))
    } else {
      List <- append(List, list(path = pathName, node = xmlName(x), value = "NULL", parent = parentName, type = "element"))      
    }
  }

  ## attributes
  if(!is.null(xmlAttrs(x))) {
    num.attributes = xmlSize(xmlAttrs(x))
    for (i in seq_len(num.attributes)) {
      # get the attribute name
      attributeName <- names(xmlAttrs(x)[i])
      # get the attribute value
      attributeValue <- xmlAttrs(x)[[i]]  

      List <- append(List, list(path = pathName, node = attributeName, value = attributeValue, parent = parentName, type = "attribute"))      
    }
  }

  return(List)
}

## recursive function 
visitNode <- function(node, xpath) {
  if (is.null(node)) {
    return()
  }

  # number of children of a node
  num.children <- xmlSize(node)

  bypass <- function(n = num.children) {
    if(num.children == 0) {
      xpathSApply(node, path = xpath, getValues)
    } else {
      return(num.children)
    }
  }

  # recursive call to visitNode 
  for (i in seq_len(num.children)) { 
    visitNode(node[[i]], xpath) 
  }   

  # add list type result to data frame
  if(is.list(result <- bypass())) {    
    dt <<- do.call(rbind.fill, lapply(result, data.frame)) 
  }
} 


# read XML data from the given file
xtree <- xmlParse("test.xml")

# retrieve the root of the XML
root <- xmlRoot(xtree)

# define data frame which is to hold the data interpreted from XML
dt <- data.frame(path = NA, node = NA, value = NA, parent = NA, type = NA)

# call to recursive function
visitNode(root, xpath <- "//node()")

dt

Upvotes: 2

Views: 521

Answers (1)

MrFlick
MrFlick

Reputation: 206197

I really wish there was good XSLT support inR but i can't seem to find a great package for it. A different strategy would be to transform the xml into a simpler data file that you can easily read with read.table or something else. You can pass it pretty easily with xmlEventParse. Here's a custom handler which seems to create the data you want

getHandler<-function(file="", sep=",") {
    list(.startDocument = function(.state) {
           cat("path","node","value","parent","type", file=file, sep=sep)
           cat("\n", file=file, sep=sep, append=T)
           .state
    }, .startElement=function(name, atts, .state) {
       .state$path <- c(.state$path, name)
       cat(paste(.state$path, collapse="/"), name, NA, .state$path[length(.state$path)-1], "element", sep=sep, file=file, append=T)
       cat("\n",  file=file, append=T)
       if(!is.null(atts)) {
           cat(paste(paste(.state$path, collapse="/"), names(atts), atts, .state$path[length(.state$path)-1], "attribute", sep=sep, collapse="\n"), file=file, append=T)
           cat("\n",file=file, append=T)
       }
       .state
    }, .endElement=function(name, .state) {
       .state$path <- .state$path[-length(.state$path)]
       .state
    }, .text=function(value, .state) {
       value <- gsub("^\\s+|\\s+$", "", value)
       if(nchar(value)>0) {
           cat(paste(.state$path, collapse="/"), .state$path[length(.state$path)], value, .state$path[length(.state$path)-1], "text", sep=sep, file=file, append=T)
           cat("\n", file=file, append=T)
       }
       .state
    })
}

So it's not exactly pretty but it's basically just building a string with cat(). We can then use it with

zz <- xmlEventParse("test.xml",
    handlers = getHandler(), 
    state = list(path=character(0)), useDotNames=TRUE)

This will output the the data with comma separated values to the screen. To save to a file, you can do

zz <- xmlEventParse("test.xml",
    handlers = getHandler(file="ok.txt", sep="\t"), 
    state = list(path=character(0)), useDotNames=TRUE)

which will write the data as delimited to a file named "ok.txt". You can then read the data in with

read.table("ok.txt", sep="\t", header=T)

which returns

                      path      node                value  parent      type
1                  CATALOG   CATALOG                 <NA>           element
2            CATALOG/PLANT     PLANT                 <NA> CATALOG   element
3            CATALOG/PLANT        id                    1 CATALOG attribute
4            CATALOG/PLANT  required                false CATALOG attribute
5     CATALOG/PLANT/COMMON    COMMON                 <NA>   PLANT   element
6     CATALOG/PLANT/COMMON    Source                  NLM   PLANT attribute
7     CATALOG/PLANT/COMMON    COMMON            Bloodroot   PLANT      text
8  CATALOG/PLANT/BOTANICAL BOTANICAL                 <NA>   PLANT   element
9  CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
10      CATALOG/PLANT/DATE      DATE                 <NA>   PLANT   element
11 CATALOG/PLANT/DATE/Year      Year                 <NA>    DATE   element
12 CATALOG/PLANT/DATE/Year      Year                 2013    DATE      text
13           CATALOG/PLANT     PLANT                 <NA> CATALOG   element
14           CATALOG/PLANT        id                    2 CATALOG attribute
15           CATALOG/PLANT  required                 true CATALOG attribute
16    CATALOG/PLANT/COMMON    COMMON                 <NA>   PLANT   element
17    CATALOG/PLANT/COMMON    Source                  LNP   PLANT attribute
18    CATALOG/PLANT/COMMON    COMMON            Columbine   PLANT      text
19 CATALOG/PLANT/BOTANICAL BOTANICAL                 <NA>   PLANT   element
20 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
21      CATALOG/PLANT/DATE      DATE                 <NA>   PLANT   element
22 CATALOG/PLANT/DATE/Year      Year                 <NA>    DATE   element
23 CATALOG/PLANT/DATE/Year      Year                 2014    DATE      text

Now there are more rows then you had in your sample, but some of the selection rules weren't that clear to me.

The main idea is that xmlEventParse is more efficient than xmlParse because it doesn't have to load the entire tree. Additionally by using cat() to dump to a file, i don't have to worry about memory management right away (but it's not exactly like writing to disk is all that great either).

Anyway, it's at least another option to consider.

Upvotes: 4

Related Questions