Reputation: 257
Background:
I have an XML document with the following structure:
<records>
<record id="512" size="1">
<user id="8412" origin="ab"/>
<category id="105">Certificates</category>
<rating>80</rating>
<text>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</text>
</record>
<record id="452" size="2">
<user id="7623" origin="bb"/>
<category id="105">Certificates</category>
<rating>70</rating>
<text>
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</text>
</record>
</records>
What I'm trying to do:
Using R, I'm trying to convert this XML information into a dataframe, where each row represents a single record, and each column represents either attribute or text data for that record (with the goal of including all the data that exists in the XML document).
This is what the final output should look like:
Record ID | Size | User ID | ... | Text |
452| 2| 7623| ... | Lorem ipsum... |
Also, since there are around 1,000,000 records, and the file containing them is ~500MB, I'm trying to find a relatively efficient way to do this.
What I've tried so far:
I've looked at a number of related questions on the topic, but none of them offered a solution that applied in this case.
First, I tried using the 'xmlToDataFrame' function in the XML package using the following code, but it's only extracting the text data, and not the attributes:
library(XML)
doc = xmlParse("My_document.xml")
xmldf = xmlToDataFrame(doc, nodes = "//record")
xmldf = xmlToDataFrame(nodes = getNodeSet(doc, "//record"))
The same happens when I try to use the flatxml package, despite the fact that during the initial import of the XML document it does extract the relevant attribute data:
library(flatxml)
doc = fxml_importXMLFlat("My_document.xml")
xmldf = fxml_toDataFrame(xml_original, siblings.of = 2)
I also tried a slightly different approach using the xml2 package:
library(xml2)
doc <- read_xml('My_document.xml')
rows <- xml_children(doc)
data.frame(
Record_ID = as.numeric(xml_attr(rows,"id")),
Size = as.numeric(xml_attr(rows,"size")),
User_ID = as.numeric(xml_attr(rows,"id")),
Origin = as.character(xml_attr(rows,"origin")),
Category = as.character(xml_text(rows,"category")),
Category_ID = as.numeric(xml_attr(rows,"id")),
Rating = as.numeric(xml_text(rows,"rating")),
Text = as.character(xml_text(rows,"text"))
) -> xmldf
Here I had a different set of issues: I'm able to extract attribute data, but only from the 'record' node. This means that it copies the 'id' data from the record for the 'User_ID', and is unable to access the relevant data for things such as the 'origin' attribute. In addition, this process also pulls all the text information from all nodes simultaneously each time I attempt to extract it.
Upvotes: 1
Views: 665
Reputation: 107567
Consider binding attributes with the internal method, xmlAttrsToDataFrame
, and elements with xmlToDataFrame
, assuming only one set of user and sibling tags per record.
library(XML)
...
# BIND ATTRIBUTES AND ELEMENTS
record_df <- cbind(XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//record')),
XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//user')),
xmlToDataFrame(doc, nodes = getNodeSet(doc, "//record"))
)
# RENAME COLUMNS
record_df <- setNames(record_df, c("record_id", "record_size", "user_id", "user_origin",
"record_user", "record_category", "record_rating", "record_text"))
record_df
# record_id record_size user_id user_origin record_user record_category record_rating record_text
# 1 512 1 8412 ab Certificates 80 \nLorem ipsum dolor ...
# 2 452 2 7623 bb Certificates 70 \nUt enim ad minim ...
Upvotes: 1