Reputation: 129
I would like to convert an XML file into a dataframe. I have found some functions which allow me to read the XML data, however I am not able to get a dataframe with the same structure as the initial XML file (= structure that you would get if you open the XML file in Excel).
This is my original XML code:
<Data>
<Frame timestamp='17/09/2014 20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<TrackingFrame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>
This allows me already to get a list of the data: library(XML)
# Convert xml data to R
data <- xmlTreeParse(file="c:/R/CL/filename.xml",useInternalNode=TRUE)
# Create a list of the data
xl<-xmlToList(data)
Ideally I would like to get a dataframe based on this XML data that looks the same as when you would input the XML data in Excel. However, when I look at the output of xl then I see that this is organized in Objects and Times. Normally when I open the XML files in Excel this information is linked (and every Object has also columns with Time information)
This is the output of xl<-xmlToList(data):
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "3037" "1668" "15593" "0"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "3605" "931" "15603" "4"
$Frame$Object
type DISTANCE VOLUME id code
“Bus” "3563" "488" "15604" "9"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "2161" "1592" "15615" "21"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "4942" "57" "15624" "1"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "784" "47" "15625" "10"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "3301" "2041" "15626" "42"
$Frame$Object
type DISTANCE VOLUME id code
“Bus” "2040" "2945" "15630" "27"
$Frame$Object
type DISTANCE VOLUME Z
"Airplane" "2865" "2722" "0"
$Frame$Time
timestamp timecode
"17/09/2014 20:54:59.902" "75299902"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "4941" "51" "15624" "1"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "789" "47" "15625" "10"
$Frame$Object
type DISTANCE VOLUME id code
"Taxi" "3300" "2069" "15626" "42"
$Frame$Object
type DISTANCE VOLUME id code
“Bus” "2027" "2947" "15630" "27"
$Frame$Object
type DISTANCE VOLUME Z
"Airplane" "2865" "2722" "0"
$Frame$Time
timestamp timecode
"17/09/2014 20:54:59.771" "75299771"
This lists contains 2 table structures / frames: Frame$Object and Frame$Time. I would like to combine these 2 structures into one combined table (by repeating the columns timestamp and timecode with the time information for every Object).
See here below the desired output (with the same structure as when you would enter the XML file in Excel):
type DISTANCE VOLUME id code z timestamp timecode
Taxi 3037 1668 15593 0 17/09/2014 20:54:59.902 75299902
Taxi 3605 931 15603 4 17/09/2014 20:54:59.902 75299902
Bus 3563 488 15604 9 17/09/2014 20:54:59.900 75299902
Taxi 4942 57 15624 1 17/09/2014 20:54:59.900 75299902
Taxi 784 47 15625 10 17/09/2014 20:54:59.900 75299902
Taxi 3301 2041 15626 42 17/09/2014 20:54:59.900 75299902
Bus 2040 2945 15630 27 17/09/2014 20:54:59.900 75299902
Airplane 2865 2722 0 17/09/2014 20:54:59.900 75299902
Taxi 4941 51 15624 1 17/09/2014 20:54:59.771 75299771
Taxi 789 47 15625 10 17/09/2014 20:54:59.771 75299771
Taxi 3300 2069 15626 42 17/09/2014 20:54:59.771 75299771
Bus 2027 2947 15630 27 17/09/2014 20:54:59.771 75299771
Airplane 2865 2722 0 17/09/2014 20:54:59.771 75299771
Which functions would work to achieve this result? Thank you on beforehand for your help!
Upvotes: 1
Views: 7679
Reputation: 78792
You can use xml2
and dplyr
for a quick conversion:
library(xml2)
library(dplyr)
dat <- "<Data>
<Frame timestamp='17/09/2014 20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<Frame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>"
doc <- read_xml(dat)
# bind the data.frames built in the iterator together
bind_rows(lapply(xml_find_all(doc, "//Frame"), function(x) {
# extract the attributes from the parent tag as a data.frame
parent <- data.frame(as.list(xml_attrs(x)), stringsAsFactors=FALSE)
# make a data.frame out of the attributes of the kids
kids <- bind_rows(lapply(xml_children(x), function(x) as.list(xml_attrs(x))))
# combine them
cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
}))
## Source: local data frame [13 x 8]
##
## timestamp timecode type DISTANCE VOLUME id code Z
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 17/09/2014 20:55:00.902 75299902 Taxi 3037 1668 15593 0 NA
## 2 17/09/2014 20:55:00.902 75299902 Taxi 3605 931 15603 4 NA
## 3 17/09/2014 20:55:00.902 75299902 Bus 3563 488 15604 9 NA
## 4 17/09/2014 20:55:00.902 75299902 Taxi 4942 57 15624 1 NA
## 5 17/09/2014 20:55:00.902 75299902 Taxi 784 47 15625 10 NA
## 6 17/09/2014 20:55:00.902 75299902 Taxi 3301 2041 15626 42 NA
## 7 17/09/2014 20:55:00.902 75299902 Bus 2040 2945 15630 27 NA
## 8 17/09/2014 20:55:00.902 75299902 Airplane 2865 2722 NA NA 0
## 9 17/09/2014 20:54:59.771 75299771 Taxi 4941 51 15624 1 NA
## 10 17/09/2014 20:54:59.771 75299771 Taxi 789 47 15625 10 NA
## 11 17/09/2014 20:54:59.771 75299771 Taxi 3300 2069 15626 42 NA
## 12 17/09/2014 20:54:59.771 75299771 Bus 2027 2947 15630 27 NA
## 13 17/09/2014 20:54:59.771 75299771 Airplane 2865 2722 NA NA 0
You'll need to convert the types as necessary.
You can do something similar if you're stuck with the XML
package:
doc <- xmlParse(dat)
bind_rows(xpathApply(doc, "//Frame", function(x) {
parent <- data.frame(as.list(xmlAttrs(x)), stringsAsFactors=FALSE)
kids <- bind_rows(lapply(xmlChildren(x), function(x) as.list(xmlAttrs(x))))
cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
}))
Upvotes: 7
Reputation: 107567
Consider the XML
library's xpathsapply()
route with workarounds to retrieve timestamp
and timecode
for each child and handle the missing attributes for id
and code
:
library(XML)
doc <- xmlParse("C:/Path/To/XML/File.xml")
# RETRIEVE FRAME ATTRS DATA FOR EACH OBJECT CHILD
timestamp <- c()
timecode <- c()
numberofobjs <- length(xpathSApply(doc, "//Object"))
for (i in (1:numberofobjs)) {
timestamp <- c(timestamp, xpathSApply(doc, sprintf("//Object[%s]/ancestor::Frame", i),
xmlGetAttr, "timestamp"))
timecode <- c(timecode, xpathSApply(doc, sprintf("//Object[%s]/ancestor::Frame", i),
xmlGetAttr, "timecode"))
}
# XPATH TO EACH ATTRIBUTE
type <- xpathSApply(doc, "//Object", xmlGetAttr, "type")
distance <- xpathSApply(doc, "//Object", xmlGetAttr,"DISTANCE")
volume <- xpathSApply(doc, "//Object", xmlGetAttr, "VOLUME")
id <- xpathSApply(doc, "//Object", xmlGetAttr, "id")
id <- sapply(id, function(x) ifelse(is.null(x), NA, x)) # REMOVE NULLS
code <- xpathSApply(doc, "//Object", xmlGetAttr, "code")
code <- sapply(id, function(x) ifelse(is.null(x), NA, x)) # REMOVE NULLS
# COMBINE LISTS INTO DATA FRAME
xmldf <- data.frame(timecode = unlist(timecode),
timestamp = unlist(timestamp),
type = unlist(type),
distance = unlist(distance),
volume = unlist(volume),
id = unlist(id),
code = unlist(code))
Upvotes: 0
Reputation: 4002
Try
data <- xmlParse(file="c:/R/CL/filename.xml")
And something like :
sapply(getNodeSet(data, "//Frame/Object[@type]"), xmlValue)
It should give you a vector of all the type of the node objects under a node Frame. More here: http://www.w3schools.com/xsl/xpath_syntax.asp
Upvotes: 0