Jet She
Jet She

Reputation: 33

R: parse large unstructured xml file

I have a very complicated xml file need to parse and present in dataframe format in R. The structure may similar to the following example. The nodes are not paralleled.

<Root>
  <A>
   <info1>a</info1>
     <child>
       <info2>b</info2>
       <info3>c</info3>
       <info4>d</info4>
     </child>
   <info5>e</info5>
  </A>
  <B>
   <info6>f</info6>
   <info7>g</info7>
  </B>
</Root>

I come up some code to parse the file:

doc <- xmlParse(file="sample.xml", useInternal = TRUE)
rootnode <- xmlRoot(doc)
df1<-xmlToDataFrame(nodes=getNodeSet(rootnode, "//Root/A"))
df2<-xmlToDataFrame(nodes=getNodeSet(rootnode, "//Root/B"))
Final<-cbind.data.frame(df1,df2, all=TRUE)

The result returned as: (all the value form node were shrink together)

info1 child info5 info6 info7
  a    bcd    e     f     g

However, the ideal result I want is:

info1 info2 info3 info4 info5 info6 info7
  a     b     c     d     e     f     g

Because there are large number of nodes in the xml file similar to the situation above, it is not wise to manually manipulate the dataframe.
I also try to change the path statement to "//Root/A/child", then all the value under node A and node B will be missed. Does anyone could offer the solution to this problem. Thanks in advance.

Upvotes: 3

Views: 273

Answers (3)

Martin Morgan
Martin Morgan

Reputation: 46856

Match the nodes using starts-with()

> doc = xmlParse(xml)
> xpathSApply(doc, "//*[starts-with(name(), 'info')]", xmlValue)
[1] "a" "b" "c" "d" "e" "f" "g"
> xpathSApply(doc, "//*[starts-with(name(), 'info')]", xmlName)
[1] "info1" "info2" "info3" "info4" "info5" "info6" "info7"

so

query <- "//*[starts-with(name(), 'info')]"
setNames(
    xpathSApply(doc, query, xmlValue),
    xpathSApply(doc, query, xmlName)
)

Upvotes: 0

MKR
MKR

Reputation: 20085

One can try xmlToList and unlist to reduce xml data in named vector format. The names can be changed using gsub to match OP's expectations as:

library(XML)
result <- unlist(xmlToList(xmlParse(xml)))
#Change the name to refer only child 
names(result) <- gsub(".*\\.(\\w+)$","\\1", names(result))
result 
# info1 info2 info3 info4 info5 info6 info7 
# "a"   "b"   "c"   "d"   "e"   "f"   "g"

Data:

xml <- "<Root>
  <A>
  <info1>a</info1>
  <child>
  <info2>b</info2>
  <info3>c</info3>
  <info4>d</info4>
  </child>
  <info5>e</info5>
  </A>
  <B>
  <info6>f</info6>
  <info7>g</info7>
  </B>
  </Root>"

Upvotes: 2

Kim
Kim

Reputation: 4298

In a less structured XML, it is better to do the following:

library(XML)
Final <- data.frame(xmlToList(rootnode), recursive = T, use.names = T)

If you don't like the automatically set column names, you can simply do use.names = F and set your own names.

Upvotes: 0

Related Questions