Martijn
Martijn

Reputation: 129

R - How to convert XML to dataframe in R with the correct structure?

I would like to convert an XML file into a dataframe. I have found some functions which allow me to read the XML data, however I am not able to get a dataframe with the same structure as the initial XML file (= structure that you would get if you open the XML file in Excel).

This is my original XML code:

<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<TrackingFrame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>

This allows me already to get a list of the data: library(XML)

# Convert xml data to R
data <- xmlTreeParse(file="c:/R/CL/filename.xml",useInternalNode=TRUE)
# Create a list of the data
xl<-xmlToList(data)

Ideally I would like to get a dataframe based on this XML data that looks the same as when you would input the XML data in Excel. However, when I look at the output of xl then I see that this is organized in Objects and Times. Normally when I open the XML files in Excel this information is linked (and every Object has also columns with Time information)

This is the output of xl<-xmlToList(data):

$Frame$Object
     type         DISTANCE         VOLUME        id       code 
"Taxi"    "3037"    "1668"   "15593"       "0" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3605"   "931" "15603"     "4" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "3563"   "488" "15604"     "9" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "2161"  "1592" "15615"    "21" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4942"    "57" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "784"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3301"  "2041" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "2040"  "2945" "15630"    "27" 


$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.902"                "75299902"

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4941"    "51" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "789"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3300"  "2069" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "2027"  "2947" "15630"    "27" 

$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.771"                "75299771"

This lists contains 2 table structures / frames: Frame$Object and Frame$Time. I would like to combine these 2 structures into one combined table (by repeating the columns timestamp and timecode with the time information for every Object).

See here below the desired output (with the same structure as when you would enter the XML file in Excel):

type    DISTANCE    VOLUME  id  code    z   timestamp   timecode
Taxi    3037    1668    15593   0       17/09/2014 20:54:59.902 75299902
Taxi    3605    931 15603   4       17/09/2014 20:54:59.902 75299902
Bus 3563    488 15604   9       17/09/2014 20:54:59.900 75299902
Taxi    4942    57  15624   1       17/09/2014 20:54:59.900 75299902
Taxi    784 47  15625   10      17/09/2014 20:54:59.900 75299902
Taxi    3301    2041    15626   42      17/09/2014 20:54:59.900 75299902
Bus 2040    2945    15630   27      17/09/2014 20:54:59.900 75299902
Airplane    2865    2722            0   17/09/2014 20:54:59.900 75299902
Taxi    4941    51  15624   1        17/09/2014 20:54:59.771    75299771
Taxi    789 47  15625   10       17/09/2014 20:54:59.771    75299771
Taxi    3300    2069    15626   42       17/09/2014 20:54:59.771    75299771
Bus 2027    2947    15630   27       17/09/2014 20:54:59.771    75299771
Airplane    2865    2722            0    17/09/2014 20:54:59.771    75299771

Which functions would work to achieve this result? Thank you on beforehand for your help!

Upvotes: 1

Views: 7679

Answers (3)

hrbrmstr
hrbrmstr

Reputation: 78792

You can use xml2 and dplyr for a quick conversion:

library(xml2)
library(dplyr)

dat <- "<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<Frame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>"

doc <- read_xml(dat)

# bind the data.frames built in the iterator together
bind_rows(lapply(xml_find_all(doc, "//Frame"), function(x) {

  # extract the attributes from the parent tag as a data.frame
  parent <- data.frame(as.list(xml_attrs(x)), stringsAsFactors=FALSE)

  # make a data.frame out of the attributes of the kids
  kids <- bind_rows(lapply(xml_children(x), function(x) as.list(xml_attrs(x))))

  # combine them
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)

}))

## Source: local data frame [13 x 8]
## 
##                   timestamp timecode     type DISTANCE VOLUME    id  code     Z
##                       (chr)    (chr)    (chr)    (chr)  (chr) (chr) (chr) (chr)
## 1  17/09/2014  20:55:00.902 75299902     Taxi     3037   1668 15593     0    NA
## 2  17/09/2014  20:55:00.902 75299902     Taxi     3605    931 15603     4    NA
## 3  17/09/2014  20:55:00.902 75299902      Bus     3563    488 15604     9    NA
## 4  17/09/2014  20:55:00.902 75299902     Taxi     4942     57 15624     1    NA
## 5  17/09/2014  20:55:00.902 75299902     Taxi      784     47 15625    10    NA
## 6  17/09/2014  20:55:00.902 75299902     Taxi     3301   2041 15626    42    NA
## 7  17/09/2014  20:55:00.902 75299902      Bus     2040   2945 15630    27    NA
## 8  17/09/2014  20:55:00.902 75299902 Airplane     2865   2722    NA    NA     0
## 9   17/09/2014 20:54:59.771 75299771     Taxi     4941     51 15624     1    NA
## 10  17/09/2014 20:54:59.771 75299771     Taxi      789     47 15625    10    NA
## 11  17/09/2014 20:54:59.771 75299771     Taxi     3300   2069 15626    42    NA
## 12  17/09/2014 20:54:59.771 75299771      Bus     2027   2947 15630    27    NA
## 13  17/09/2014 20:54:59.771 75299771 Airplane     2865   2722    NA    NA     0

You'll need to convert the types as necessary.

You can do something similar if you're stuck with the XML package:

doc <- xmlParse(dat)

bind_rows(xpathApply(doc, "//Frame", function(x) {
  parent <- data.frame(as.list(xmlAttrs(x)), stringsAsFactors=FALSE)
  kids <- bind_rows(lapply(xmlChildren(x), function(x) as.list(xmlAttrs(x))))
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
}))

Upvotes: 7

Parfait
Parfait

Reputation: 107567

Consider the XML library's xpathsapply() route with workarounds to retrieve timestamp and timecode for each child and handle the missing attributes for id and code:

library(XML)

doc <- xmlParse("C:/Path/To/XML/File.xml")

# RETRIEVE FRAME ATTRS DATA FOR EACH OBJECT CHILD
timestamp <- c()
timecode <- c()
numberofobjs <- length(xpathSApply(doc, "//Object"))
for (i in (1:numberofobjs)) {
    timestamp <- c(timestamp, xpathSApply(doc, sprintf("//Object[%s]/ancestor::Frame", i), 
                                          xmlGetAttr, "timestamp"))
    timecode <- c(timecode, xpathSApply(doc, sprintf("//Object[%s]/ancestor::Frame", i), 
                                        xmlGetAttr, "timecode"))
}

# XPATH TO EACH ATTRIBUTE
type <- xpathSApply(doc, "//Object", xmlGetAttr, "type")
distance <- xpathSApply(doc, "//Object", xmlGetAttr,"DISTANCE")
volume <- xpathSApply(doc, "//Object", xmlGetAttr, "VOLUME")
id <- xpathSApply(doc, "//Object", xmlGetAttr, "id")
id <- sapply(id, function(x) ifelse(is.null(x), NA, x))     # REMOVE NULLS
code <- xpathSApply(doc, "//Object", xmlGetAttr, "code")
code <- sapply(id, function(x) ifelse(is.null(x), NA, x))   # REMOVE NULLS

# COMBINE LISTS INTO DATA FRAME
xmldf <- data.frame(timecode = unlist(timecode), 
                    timestamp = unlist(timestamp), 
                    type = unlist(type), 
                    distance = unlist(distance), 
                    volume = unlist(volume), 
                    id = unlist(id), 
                    code = unlist(code))

Upvotes: 0

YCR
YCR

Reputation: 4002

Try

data <- xmlParse(file="c:/R/CL/filename.xml")

And something like :

sapply(getNodeSet(data, "//Frame/Object[@type]"), xmlValue)

It should give you a vector of all the type of the node objects under a node Frame. More here: http://www.w3schools.com/xsl/xpath_syntax.asp

Upvotes: 0

Related Questions