chintan s
chintan s

Reputation: 6488

R: Parse XML in R

I have a following XML file

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<t:Forecast xmlns:t="http://example.com">
      <Sender Abbreviation="abc" Name="xyz"/>
      <Recipient Abbreviation="efg" Name="cba"/>
      <createdUTC>2017-11-24T10:41:11Z</createdUTC>
      <MessageID>bcjs</MessageID>
      <SystemState>test</SystemState>
      <ForecastData>
          <DataHeader GroupKey="rkolo">

          <Timeseries ID="abc123">
              <TimeInt ISTUTC="2017-11-24T10:45:00Z" Out="858"/>
              <TimeInt ISTUTC="2017-11-24T11:45:00Z" Out="868"/>
          </Timeseries>

          <Timeseries ID="xyz">
              <TimeInt ISTUTC="2017-11-24T10:45:00Z" Out="870"/>
              <TimeInt ISTUTC="2017-11-24T11:45:00Z" Out="890"/>
          </Timeseries>
      </ForecastData>
</t:Forecast>

I would like to have two data frames

TimeInt                 out
2017-11-24T10:45:00Z    858
2017-11-24T11:45:00Z    868

and another data frame as

TimeInt                 out
2017-11-24T10:45:00Z    870
2017-11-24T11:45:00Z    890

So far, I have done as follows:

require(XML)

temp = xmlParse("datafile.xml")
data = xmlToList(temp)

But the output of data contains lot of nested lists. How do I get the data frames?

Edit 1: changed out values

Upvotes: 0

Views: 56

Answers (1)

Parfait
Parfait

Reputation: 107587

Consider the triple-colon method xmlAttrsToDataFrame but looping through each node index of Timeseries, even name each element with the corresponding Timeseries id.

library(XML)

txt='<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
     <t:Forecast xmlns:t="http://example.com">
        <Sender Abbreviation="abc" Name="xyz"/>
        <Recipient Abbreviation="efg" Name="cba"/>
        <createdUTC>2017-11-24T10:41:11Z</createdUTC>
        <MessageID>bcjs</MessageID>
        <SystemState>test</SystemState>
        <ForecastData>
           <DataHeader GroupKey="rkolo"/>
             <Timeseries ID="abc123">
                <TimeInt ISTUTC="2017-11-24T10:45:00Z" Out="858"/>
                <TimeInt ISTUTC="2017-11-24T11:45:00Z" Out="858"/>
             </Timeseries>

             <Timeseries ID="xyz">
                <TimeInt ISTUTC="2017-11-24T10:45:00Z" Out="870"/>
                <TimeInt ISTUTC="2017-11-24T11:45:00Z" Out="870"/>
             </Timeseries>
        </ForecastData>
     </t:Forecast>'

doc <- xmlParse(txt)

dfList <- lapply(1:length(xpathSApply(doc, "//Timeseries", xmlAttrs)), function(i)
    XML:::xmlAttrsToDataFrame(getNodeSet(doc, path=paste0('//Timeseries[',i,']/TimeInt')))
)

dfList <- setNames(dfList, xpathSApply(doc, path='//Timeseries', xmlAttrs))
dfList

Output

dfList$abc123
#                 ISTUTC Out
# 1 2017-11-24T10:45:00Z 858
# 2 2017-11-24T11:45:00Z 858

dfList$xyz
#                 ISTUTC Out
# 3 2017-11-24T10:45:00Z 870
# 4 2017-11-24T11:45:00Z 870

Upvotes: 1

Related Questions