Reputation: 6755
I have few XML items Series
structured as
<Series>
<SeriesKey>
<Value concept="LOCATION" value="NOR"/>
<Value concept="TRANSACT" value="B1_GA"/>
<Value concept="MEASURE" value="CXC"/>
</SeriesKey>
<Attributes>
<Value concept="TIME_FORMAT" value="P1Y"/>
<Value concept="UNIT" value="USD"/>
<Value concept="POWERCODE" value="6"/>
</Attributes>
<Obs>
<Time>1970</Time>
<ObsValue value="12729.77490809"/>
</Obs>
<Obs>
<Time>1971</Time>
<ObsValue value="14505.3225330031"/>
</Obs>
<Obs>
<Time>1972</Time>
<ObsValue value="17219.5954919744"/>
</Obs>
</Series>
<Series>
<SeriesKey>
<Value concept="LOCATION" value="USA"/>
<Value concept="TRANSACT" value="B1_GA"/>
<Value concept="MEASURE" value="CXC"/>
</SeriesKey>
<Attributes>
<Value concept="TIME_FORMAT" value="P1Y"/>
<Value concept="UNIT" value="USD"/>
<Value concept="POWERCODE" value="6"/>
</Attributes>
...
</Series>
I want to combine all Series
to build a dataframe structured as
data.frame(LOCATION=rep("NOR",3),
TRANSACT=rep("B1_GA",3),
TIME=c(1970,1971,1972),
VALUE=c(12729.77490809, 14505.3225330031, 17219.5954919744))
# LOCATION TRANSACT TIME VALUE
# 1 NOR B1_GA 1970 12729.77
# 2 NOR B1_GA 1971 14505.32
# 3 NOR B1_GA 1972 17219.60
From the original document stored here
url <- "http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/SNA_TABLE1/NOR+CAN+FRA+DEU+GBR+USA+ITA+JAP.B1_GA+B1G_P119+B1G+B1GVA+B1GVB_E+B1GVC+B1GVF+B1GVG_I+B1GVJ+B1GVK+B1GVL+B1GVM_N+B1GVO_Q+B1GVR_U+D21_D31+D21S1+D31S1+DB1_GA.CXC/all?startTime=1950&endTime=2013"
I can access single items with
xml <- xmlParse(url)
ns<-xmlNamespaceDefinitions(xml, simplify=T)
names(ns)[1] <- "def" #assign name "def"
and
length(xpathSApply(xml, "//message:MessageGroup/def:DataSet/def:Series/def:SeriesKey/def:Value[@concept='LOCATION']/@value", namespaces=ns))
# [1] 123
length(xpathSApply(xml, "//message:MessageGroup/def:DataSet/def:Series/def:Obs", namespaces=ns))
# [1] 3959
but I can't see any easy way to assign each Obs
to its SeriesKey
.
Upvotes: 2
Views: 79
Reputation: 46866
Not really much better, but... grab the data and namespaces
xml <- xmlParse(url)
ns <- xmlNamespaceDefinitions(xml, simplify=TRUE)
names(ns)[1] = "def"
Create an R list of Series nodes, using "//" on the assumption that the XML has consistent structure so full path specification is not required.
series <- getNodeSet(xml, "//def:Series", namespaces=ns)
For the single-element components, iterate over each node and use the xpathSApply function to extract the value; the query ".//" means relative to the current Series.
q <- ".//def:SeriesKey/def:Value[@concept='LOCATION']/@value"
LOCATION <- sapply(series, xpathSApply, q, namespaces=ns)
Maybe a less cryptic rendition of the above is
LOCATION <- sapply(series, function(elt) xpathSApply(elt, q, namespaces=ns))
For the nested components, do similar
q <- ".//def:Obs/def:Time"
TIME <- lapply(series, xpathSApply, q, xmlValue, namespaces=ns)
q <- ".//def:Obs/def:ObsValue/@value"
VALUE <- lapply(series, xpathSApply, q, namespaces=ns)
Figure out the geometry of the nesting, and use that to replicate the single elements as you construct the data frame; avoid using names both for efficiency and to avoid 'conveniences' in the data.frame() function.
geom <- sapply(TIME, length)
df <- data.frame(LOCATION=rep(unname(LOCATION), geom),
TIME=unlist(TIME, use.names=FALSE),
VALUE=unlist(VALUE, use.names=FALSE))
More or less at all costs you want to avoid rbind()ing data frames together as in @CptNemo's example -- this results in excessive copying and abysmal performance.
A more vectorized approach, placing further faith in the overall geometry of the file, is to extract elements from the XML rather than the node set (once the geometry is determined)
series <- getNodeSet(xml, "//def:Series", namespaces=ns)
geom <- sapply(series, xpathSApply, "count(.//def:Obs)", namespaces=ns)
LOCATION <- xpathSApply(xml, "//def:Value[@concept='LOCATION']/@value",
namespaces=ns)
TIME <- xpathSApply(xml, "//def:Time/text()", xmlValue, namespaces=ns)
etc.
Upvotes: 0
Reputation: 6755
Far from elegant but still...
df <- data.frame()
for (i in 1:length(xpathSApply(xml, "//message:MessageGroup/def:DataSet/def:Series", namespaces=ns))) {
location <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:SeriesKey/def:Value[@concept='LOCATION']/@value"), namespaces=ns)
transact <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:SeriesKey/def:Value[@concept='TRANSACT']/@value"), namespaces=ns)
measure <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:SeriesKey/def:Value[@concept='MEASURE']/@value"), namespaces=ns)
time <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:Obs/def:Time"), xmlValue,
namespaces=ns)
value <- xpathSApply(xml, paste0("//message:MessageGroup/def:DataSet/def:Series[",i,"]/def:Obs/def:ObsValue/@value"),
namespaces=ns)
tmp <- data.frame(location=rep(location, length(time)),
transact=rep(transact, length(time)),
measure=rep(measure, length(time)),
time=time,
value=value)
df <- rbind(df, tmp)
}
Upvotes: 1