Reputation: 367
I have a few large (~10 GB and growing every week) which I need to convert from XML to a dataframe in R for analysis. The structure of the XML is as follows (with multiple records and a few more field elements per record):
<recordGroup>
<records>
<record>
<recordId>123442</recordId>
<reportingCountry>PT</reportingCountry>
<date>2020-02-20</date>
<field>
<fieldName>Gender</fieldName>
<fieldValue>F</fieldValue>
</field>
<field>
<fieldName>Age</fieldName>
<fieldValue>57</fieldValue>
</field>
<field>
<fieldName>ClinicalSymptoms</fieldName>
<fieldValue>COUGH</fieldValue>
<fieldValue>FEVER</fieldValue>
<fieldValue>O</fieldValue>
<fieldValue>RUNOS</fieldValue>
<fieldValue>SBREATH</fieldValue>
</field>
</record>
</records>
</recordGroup>
I have been trying to find the most efficient way of extracting the data and converting them to a data.frame, however one major challenge is that the files are quite large and both XML and XML2 run into problems apart that it takes hours to process. My current strategy is using xmlEventParse
using the code below, but this seems to be even more inefficient.
value_df <- data.frame(recordId = as.character(), vardf = as.character(), value = as.character())
nvar <- 0
xmlEventParse(xmlDoc_clean,
list(
startElement = function (name, attrs) {
tagName <<- name
},
text = function (x) {
if (nchar(x) > 0) {
if (tagName == "recordId") {
rec <<- x
} else
if (tagName == "fieldName") {
var_f <<- x
} else {
if (tagName == 'fieldValue') {
v <- x
nvar <<- nvar + 1
value_df[nvar, 1:3] <<- c(rec, var_f, v)
}
}
}
},
endElement = function (name) {
if (name == 'record') {
print(nvar)
}
}
))
I have tried XML2 (memory issues), XML (memory issues as well with the standard DOM parsing) and also was going to try to use XMLSchema but didn't manage to get it to work. Both XML and XML2 work if files are split up.
Would appreciate any guidance on improving efficiency as the files I am working with are becoming larger every week. I am using R on a linux machine.
Upvotes: 3
Views: 1786
Reputation: 367
In the end the fastest approach I found was the following:
foreach
with 6 workers, but mileage varies depending on how much RAM is available.foreach
loop - this time it is possible to use all cores so I have gone with 12 workers. The processing uses XML2 as I found that to be the fastest way. Initially the extracted data is in a long format but I then convert to a wide format within the loop.fwrite
to save the csv file. This seems to be the most efficient way.With this approach I can process a 2.6GB XML file in 6.5 minutes.
I will add code eventually but it is quite specific so need to generalize a bit.
Upvotes: 0
Reputation: 107587
When memory is a challenge, consider hard disk. Specifically, consider building a large CSV version of extracted parsed XML data with iterative append calls via write.csv
in an xmlEventParse
run:
# INITIALIZE EMPTY CSV WITH EMPTY ROW
csv <- file.path("C:", "Path", "To", "Large.csv")
fileConn <- file(csv); writeLines(paste0("id,tag,text"), fileConn); close(fileConn)
i <- 0
doc <- file.path("C:", "Path", "To", "Large.xml")
output <- xmlEventParse(doc,
list(startElement=function(name, attrs){
if(name == "recordId") {i <<- i + 1}
tagName <<- name
}, text=function(x) {
if(nchar(trimws(x)) > 0) {
write.table(data.frame(id=i, tag=tagName, text=x),
file=csv, append=TRUE, sep=",",
row.names=FALSE, col.names=FALSE)
}
}),
useTagName=FALSE, addContext=FALSE)
Output
Obviously, further data wrangling will be needed for proper row/column migration. But you can now read large CSV with the many tools out there or via chunks.
id,tag,text
1,"recordId","123442"
1,"reportingCountry","PT"
1,"date","2020-02-20"
1,"fieldName","Gender"
1,"fieldValue","F"
1,"fieldName","Age"
1,"fieldValue","57"
1,"fieldName","ClinicalSymptoms"
1,"fieldValue","COUGH"
1,"fieldValue","FEVER"
1,"fieldValue","O"
1,"fieldValue","RUNOS"
1,"fieldValue","SBREATH"
Upvotes: 1