Katherine York
Katherine York

Reputation: 13

Import XML to R data frame

I am trying to import an xml file into R. It is of the format below with an event on each row followed by a number of attributes - which ones depend on the event type. This file is 0.7GB and future versions may be much bigger. I would like to create a data frame with each event on a new row and all the possible attributes in separate columns (meaning some will be empty depending on the event type). I have looked elsewhere for answers but they all seem to be dealing with XML files in a tree structure and I can't work out how to apply them to this format.

I am new to R and have no experience with XML files so please give me the "for dummies" answer with plenty of explanation. Thanks!

<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
    <event time="21510.0" type="actend" person="3" link="1" actType="h"  />
    <event time="21510.0" type="departure" person="3" link="1" legMode="car"  />
    <event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3"  />
    <event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0"  />

...

</events>

Upvotes: 1

Views: 1871

Answers (1)

csgroen
csgroen

Reputation: 2541

You can try something like this:

original_xml <- '<?xml version="1.0" encoding="utf-8"?>
    <events version="1.0">
        <event time="21510.0" type="actend" person="3" link="1" actType="h"  />
            <event time="21510.0" type="departure" person="3" link="1" legMode="car"  />
                <event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3"  />
                    <event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0"  />
                    </events>'
library(xml2)

data2 <- xml_children(read_xml(original_xml))
attr_names <- unique(names(unlist(xml_attrs(data2))))

xmlDataFrame <- as.data.frame(sapply(attr_names, function (attr) {
    xml_attr(data2, attr = attr)
}), stringsAsFactors = FALSE)

#-- since all columns are strings, you may want to turn the numeric columns to numeric

xmlDataFrame[, c("time", "person", "link", "vehicle")] <- sapply(xmlDataFrame[, c("time", "person", "link", "vehicle")], as.numeric)

If you have additional "numeric" columns, you can add them at the end to convert the data to its proper class.

Upvotes: 2

Related Questions