CptNemo
CptNemo

Reputation: 6755

Dataframe from sibling XML items

I have few XML items Series structured as

<Series>
  <SeriesKey>
  <Value concept="LOCATION" value="NOR"/>
  <Value concept="TRANSACT" value="B1_GA"/>
  <Value concept="MEASURE" value="CXC"/>
  </SeriesKey>
  <Attributes>
  <Value concept="TIME_FORMAT" value="P1Y"/>
  <Value concept="UNIT" value="USD"/>
  <Value concept="POWERCODE" value="6"/>
  </Attributes>
  <Obs>
  <Time>1970</Time>
  <ObsValue value="12729.77490809"/>
  </Obs>
  <Obs>
  <Time>1971</Time>
  <ObsValue value="14505.3225330031"/>
  </Obs>
  <Obs>
  <Time>1972</Time>
  <ObsValue value="17219.5954919744"/>
  </Obs>
  </Series>
<Series>
  <SeriesKey>
   <Value concept="LOCATION" value="USA"/>
   <Value concept="TRANSACT" value="B1_GA"/>
   <Value concept="MEASURE" value="CXC"/>
  </SeriesKey>
  <Attributes>
   <Value concept="TIME_FORMAT" value="P1Y"/>
   <Value concept="UNIT" value="USD"/>
   <Value concept="POWERCODE" value="6"/>
   </Attributes>
    ...
  </Series>

I want to combine all Series to build a dataframe structured as

data.frame(LOCATION=rep("NOR",3),
           TRANSACT=rep("B1_GA",3),
           TIME=c(1970,1971,1972),
           VALUE=c(12729.77490809, 14505.3225330031, 17219.5954919744))
# LOCATION TRANSACT TIME    VALUE
# 1      NOR    B1_GA 1970 12729.77
# 2      NOR    B1_GA 1971 14505.32
# 3      NOR    B1_GA 1972 17219.60

From the original document stored here

url <- "http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/SNA_TABLE1/NOR+CAN+FRA+DEU+GBR+USA+ITA+JAP.B1_GA+B1G_P119+B1G+B1GVA+B1GVB_E+B1GVC+B1GVF+B1GVG_I+B1GVJ+B1GVK+B1GVL+B1GVM_N+B1GVO_Q+B1GVR_U+D21_D31+D21S1+D31S1+DB1_GA.CXC/all?startTime=1950&endTime=2013"

I can access single items with

xml <- xmlParse(url)

ns<-xmlNamespaceDefinitions(xml, simplify=T)
names(ns)[1] <- "def"   #assign name "def"

and

length(xpathSApply(xml, "//message:MessageGroup/def:DataSet/def:Series/def:SeriesKey/def:Value[@concept='LOCATION']/@value", namespaces=ns))
# [1] 123

length(xpathSApply(xml, "//message:MessageGroup/def:DataSet/def:Series/def:Obs", namespaces=ns))
# [1] 3959

but I can't see any easy way to assign each Obs to its SeriesKey.

Upvotes: 2

Views: 79

Answers (2)

Martin Morgan
Martin Morgan

Reputation: 46866

Not really much better, but... grab the data and namespaces

xml <- xmlParse(url)
ns <- xmlNamespaceDefinitions(xml, simplify=TRUE)
names(ns)[1] = "def"

Create an R list of Series nodes, using "//" on the assumption that the XML has consistent structure so full path specification is not required.

series <- getNodeSet(xml, "//def:Series", namespaces=ns)

For the single-element components, iterate over each node and use the xpathSApply function to extract the value; the query ".//" means relative to the current Series.

q <- ".//def:SeriesKey/def:Value[@concept='LOCATION']/@value"
LOCATION <- sapply(series, xpathSApply, q, namespaces=ns)

Maybe a less cryptic rendition of the above is

LOCATION <- sapply(series, function(elt) xpathSApply(elt, q, namespaces=ns))

For the nested components, do similar

q <- ".//def:Obs/def:Time"
TIME <- lapply(series, xpathSApply, q, xmlValue, namespaces=ns)

q <- ".//def:Obs/def:ObsValue/@value"
VALUE <- lapply(series, xpathSApply, q, namespaces=ns)

Figure out the geometry of the nesting, and use that to replicate the single elements as you construct the data frame; avoid using names both for efficiency and to avoid 'conveniences' in the data.frame() function.

geom <- sapply(TIME, length)
df <- data.frame(LOCATION=rep(unname(LOCATION), geom),
                 TIME=unlist(TIME, use.names=FALSE),
                 VALUE=unlist(VALUE, use.names=FALSE))

More or less at all costs you want to avoid rbind()ing data frames together as in @CptNemo's example -- this results in excessive copying and abysmal performance.

A more vectorized approach, placing further faith in the overall geometry of the file, is to extract elements from the XML rather than the node set (once the geometry is determined)

series <- getNodeSet(xml, "//def:Series", namespaces=ns)
geom <- sapply(series, xpathSApply, "count(.//def:Obs)", namespaces=ns)

LOCATION <- xpathSApply(xml, "//def:Value[@concept='LOCATION']/@value",
                        namespaces=ns)
TIME <- xpathSApply(xml, "//def:Time/text()", xmlValue, namespaces=ns)

etc.

Upvotes: 0

CptNemo
CptNemo

Reputation: 6755

Far from elegant but still...

df <- data.frame()

for (i in 1:length(xpathSApply(xml, "//message:MessageGroup/def:DataSet/def:Series", namespaces=ns))) {
  location <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:SeriesKey/def:Value[@concept='LOCATION']/@value"), namespaces=ns)
  transact <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:SeriesKey/def:Value[@concept='TRANSACT']/@value"), namespaces=ns)
  measure <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:SeriesKey/def:Value[@concept='MEASURE']/@value"), namespaces=ns)

  time <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:Obs/def:Time"), xmlValue, 
                      namespaces=ns)
  value <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:Obs/def:ObsValue/@value"), 
                       namespaces=ns)

  tmp <- data.frame(location=rep(location, length(time)),
                    transact=rep(transact, length(time)),
                    measure=rep(measure, length(time)),
                    time=time,
                    value=value)

  df <- rbind(df, tmp)
}

Upvotes: 1

Related Questions