Converting a nested XML document to a data frame, when information is both in attribute and text nodes

Question

Background:

I have an XML document with the following structure:


 
  
  Certificates
  80
  
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  
 
 
  
  Certificates
  70
  
  Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

What I'm trying to do:

Using R, I'm trying to convert this XML information into a dataframe, where each row represents a single record, and each column represents either attribute or text data for that record (with the goal of including all the data that exists in the XML document).

This is what the final output should look like:

Record ID | Size | User ID | ... | Text           |
       452|     2|     7623| ... | Lorem ipsum... |

Also, since there are around 1,000,000 records, and the file containing them is ~500MB, I'm trying to find a relatively efficient way to do this.

What I've tried so far:

I've looked at a number of related questions on the topic, but none of them offered a solution that applied in this case.

First, I tried using the 'xmlToDataFrame' function in the XML package using the following code, but it's only extracting the text data, and not the attributes:

library(XML)
doc = xmlParse("My_document.xml")
xmldf = xmlToDataFrame(doc, nodes = "//record")
xmldf = xmlToDataFrame(nodes = getNodeSet(doc, "//record"))

The same happens when I try to use the flatxml package, despite the fact that during the initial import of the XML document it does extract the relevant attribute data:

library(flatxml)
doc = fxml_importXMLFlat("My_document.xml")
xmldf = fxml_toDataFrame(xml_original, siblings.of = 2)

I also tried a slightly different approach using the xml2 package:

library(xml2)
doc <- read_xml('My_document.xml')
rows <- xml_children(doc)
data.frame(
  Record_ID = as.numeric(xml_attr(rows,"id")),
  Size = as.numeric(xml_attr(rows,"size")),
  User_ID = as.numeric(xml_attr(rows,"id")),
  Origin = as.character(xml_attr(rows,"origin")),
  Category = as.character(xml_text(rows,"category")),
  Category_ID = as.numeric(xml_attr(rows,"id")),
  Rating = as.numeric(xml_text(rows,"rating")),
  Text = as.character(xml_text(rows,"text"))
) -> xmldf

Here I had a different set of issues: I'm able to extract attribute data, but only from the 'record' node. This means that it copies the 'id' data from the record for the 'User_ID', and is unable to access the relevant data for things such as the 'origin' attribute. In addition, this process also pulls all the text information from all nodes simultaneously each time I attempt to extract it.

Parfait · Accepted Answer

Consider binding attributes with the internal method, xmlAttrsToDataFrame, and elements with xmlToDataFrame, assuming only one set of user and sibling tags per record.

library(XML)
...

# BIND ATTRIBUTES AND ELEMENTS
record_df <- cbind(XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//record')),
                   XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//user')),
                   xmlToDataFrame(doc, nodes = getNodeSet(doc, "//record"))
             )    

# RENAME COLUMNS
record_df <- setNames(record_df, c("record_id", "record_size", "user_id", "user_origin",
                                   "record_user", "record_category", "record_rating", "record_text"))

record_df
#   record_id record_size user_id user_origin record_user record_category record_rating              record_text
# 1       512           1    8412          ab                Certificates            80  
Lorem ipsum dolor ...
# 2       452           2    7623          bb                Certificates            70   
Ut enim ad minim ...

Converting a nested XML document to a data frame, when information is both in attribute and text nodes

Answers (1)

Related Questions