Speed efficiency - R for loop

Question

The following code is being used to parse XML in order to extract information like node, parent, type and so on into a data frame. It works fine for a small XML file of lines but when a file of greater than 25,000 lines is used it takes a couple of minutes to process. Hence I intend optimizing the code to process faster. The aim of the function is to read any XML file and generate data as required by the data frame.

Sample XML:



   
      Bloodroot
      Aquilegia canadensis
      
         2013
      
   
   
      Columbine
      Aquilegia canadensis
      
         2014

Output:

                      path      node                value  parent      type
1                  CATALOG   CATALOG                 NULL    NULL   element
2            CATALOG/PLANT     PLANT                 NULL CATALOG   element
3            CATALOG/PLANT        id                    1   PLANT attribute
4            CATALOG/PLANT  required                false   PLANT attribute
5     CATALOG/PLANT/COMMON    COMMON            Bloodroot   PLANT      text
6     CATALOG/PLANT/COMMON    Source                  NLM  COMMON attribute
7  CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
8       CATALOG/PLANT/DATE      DATE                 NULL   PLANT   element
9  CATALOG/PLANT/DATE/Year      Year                 2013    DATE      text
10           CATALOG/PLANT     PLANT                 NULL CATALOG   element
11           CATALOG/PLANT        id                    2   PLANT attribute
12           CATALOG/PLANT  required                 true   PLANT attribute
13    CATALOG/PLANT/COMMON    COMMON            Columbine   PLANT      text
14    CATALOG/PLANT/COMMON    Source                  LNP  COMMON attribute
15 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis   PLANT      text
16      CATALOG/PLANT/DATE      DATE                 NULL   PLANT   element
17 CATALOG/PLANT/DATE/Year      Year                 2014    DATE      text

Code Snippet:

library(XML)
library(plyr)

## helper function of xPathApply
getValues <- function(x) {
  List <- list()

  # find all ancestors of a given node
  ancestorNames <- character()  
  ancestorNamesList <- xmlAncestors(x, fun = function(y) {
    ancestorNames <- c(ancestorNames, xmlName(y))})  
  pathName <- paste(ancestorNamesList, collapse = "/")

  # find the parent of a given node
  parentNode <- xmlParent(x)
  parentName <- "NULL"
  if(!is.null(parentNode)) {
    parentName <- xmlName(parentNode)
  } 

  if(inherits(x, "XMLInternalElementNode")) {
    # check if the value of the given node exists i.e. text
    if(length(xmlValue(x, recursive=FALSE)) != 0) {
      List <- append(List, list(path = pathName, node = xmlName(x), value = xmlValue(x, recursive=FALSE), parent = parentName, type = "text"))
    } else {
      List <- append(List, list(path = pathName, node = xmlName(x), value = "NULL", parent = parentName, type = "element"))      
    }
  }

  ## attributes
  if(!is.null(xmlAttrs(x))) {
    num.attributes = xmlSize(xmlAttrs(x))
    for (i in seq_len(num.attributes)) {
      # get the attribute name
      attributeName <- names(xmlAttrs(x)[i])
      # get the attribute value
      attributeValue <- xmlAttrs(x)[[i]]  

      List <- append(List, list(path = pathName, node = attributeName, value = attributeValue, parent = parentName, type = "attribute"))      
    }
  }

  return(List)
}

## recursive function 
visitNode <- function(node, xpath) {
  if (is.null(node)) {
    return()
  }

  # number of children of a node
  num.children <- xmlSize(node)

  bypass <- function(n = num.children) {
    if(num.children == 0) {
      xpathSApply(node, path = xpath, getValues)
    } else {
      return(num.children)
    }
  }

  # recursive call to visitNode 
  for (i in seq_len(num.children)) { 
    visitNode(node[[i]], xpath) 
  }   

  # add list type result to data frame
  if(is.list(result <- bypass())) {    
    dt <<- do.call(rbind.fill, lapply(result, data.frame)) 
  }
} 


# read XML data from the given file
xtree <- xmlParse("test.xml")

# retrieve the root of the XML
root <- xmlRoot(xtree)

# define data frame which is to hold the data interpreted from XML
dt <- data.frame(path = NA, node = NA, value = NA, parent = NA, type = NA)

# call to recursive function
visitNode(root, xpath <- "//node()")

dt

Speed efficiency - R for loop

Answers (1)

Related Questions