Gakku
Gakku

Reputation: 367

Most efficient way to extract data from large XML files in R

I have a few large (~10 GB and growing every week) which I need to convert from XML to a dataframe in R for analysis. The structure of the XML is as follows (with multiple records and a few more field elements per record):

<recordGroup>
  <records>
    <record>
      <recordId>123442</recordId>
      <reportingCountry>PT</reportingCountry>
      <date>2020-02-20</date>
      <field>
        <fieldName>Gender</fieldName>
        <fieldValue>F</fieldValue>
      </field>
      <field>
        <fieldName>Age</fieldName>
        <fieldValue>57</fieldValue>
      </field>
      <field>
        <fieldName>ClinicalSymptoms</fieldName>
        <fieldValue>COUGH</fieldValue>
        <fieldValue>FEVER</fieldValue>
        <fieldValue>O</fieldValue>
        <fieldValue>RUNOS</fieldValue>
        <fieldValue>SBREATH</fieldValue>
      </field>
    </record>
  </records>
</recordGroup>

I have been trying to find the most efficient way of extracting the data and converting them to a data.frame, however one major challenge is that the files are quite large and both XML and XML2 run into problems apart that it takes hours to process. My current strategy is using xmlEventParse using the code below, but this seems to be even more inefficient.

value_df <- data.frame(recordId = as.character(), vardf = as.character(), value = as.character())
nvar <- 0

xmlEventParse(xmlDoc_clean,
              list(
                startElement = function (name, attrs) {
                  tagName <<- name
                },
                text = function (x) {
                  if (nchar(x) > 0) {
                    if (tagName == "recordId") {
                      rec <<- x
                    } else
                    if (tagName == "fieldName") {
                      var_f <<- x
                    } else {
                      if (tagName == 'fieldValue') {
                        v <- x
                         nvar <<- nvar + 1
                       value_df[nvar, 1:3] <<- c(rec, var_f, v)
                      }
                    }
                  }
                },
                endElement = function (name) {
                  if (name == 'record') {
                    print(nvar)
                  }
                }
              ))

I have tried XML2 (memory issues), XML (memory issues as well with the standard DOM parsing) and also was going to try to use XMLSchema but didn't manage to get it to work. Both XML and XML2 work if files are split up.

Would appreciate any guidance on improving efficiency as the files I am working with are becoming larger every week. I am using R on a linux machine.

Upvotes: 3

Views: 1786

Answers (2)

Gakku
Gakku

Reputation: 367

In the end the fastest approach I found was the following:

  1. Split the XML files in smaller chunks using XML2. I have >100GB RAM on the server I am working on so could parallelize this process using foreach with 6 workers, but mileage varies depending on how much RAM is available.
  2. The function splitting the files returns a data.frame with the location of the split files.
  3. Process the smaller XML files in a foreach loop - this time it is possible to use all cores so I have gone with 12 workers. The processing uses XML2 as I found that to be the fastest way. Initially the extracted data is in a long format but I then convert to a wide format within the loop.
  4. The loop binds and outputs the different dataframes into one large dataframe. The final step is using fwrite to save the csv file. This seems to be the most efficient way.

With this approach I can process a 2.6GB XML file in 6.5 minutes.

I will add code eventually but it is quite specific so need to generalize a bit.

Upvotes: 0

Parfait
Parfait

Reputation: 107587

When memory is a challenge, consider hard disk. Specifically, consider building a large CSV version of extracted parsed XML data with iterative append calls via write.csv in an xmlEventParse run:

# INITIALIZE EMPTY CSV WITH EMPTY ROW
csv <- file.path("C:", "Path", "To", "Large.csv")
fileConn <- file(csv); writeLines(paste0("id,tag,text"), fileConn); close(fileConn)

i <- 0
doc <- file.path("C:", "Path", "To", "Large.xml")
output <- xmlEventParse(doc,
                        list(startElement=function(name, attrs){
                          if(name == "recordId") {i <<- i + 1}
                          tagName <<- name
                        }, text=function(x) {
                          if(nchar(trimws(x)) > 0) {
                            write.table(data.frame(id=i, tag=tagName, text=x), 
                                        file=csv, append=TRUE, sep=",", 
                                        row.names=FALSE, col.names=FALSE)
                          }
                        }),
                        useTagName=FALSE, addContext=FALSE)

Output

Obviously, further data wrangling will be needed for proper row/column migration. But you can now read large CSV with the many tools out there or via chunks.

id,tag,text
1,"recordId","123442"
1,"reportingCountry","PT"
1,"date","2020-02-20"
1,"fieldName","Gender"
1,"fieldValue","F"
1,"fieldName","Age"
1,"fieldValue","57"
1,"fieldName","ClinicalSymptoms"
1,"fieldValue","COUGH"
1,"fieldValue","FEVER"
1,"fieldValue","O"
1,"fieldValue","RUNOS"
1,"fieldValue","SBREATH"

Upvotes: 1

Related Questions