Reputation: 23
I have an XML document like the one shown below:
<root>
<Item>
<A>text1</A>
<B>text2</B>
<C>text3</C>
<C>text4</C>
<C>text5</C>
<D>text6</D>
...
</Item>
<Item>
...
</Item>
...
</root>
It's relatively simple with one complicating factor: Each item
can have any number of C
s.
Ultimately, I'd love to have this in a table like:
A B C D
1 text1 text2 <list [3]> text6
I've already created my table for the other variables (in what's probably a messy way, but it works):
vnames<-c("A","B","D")
dat<-list()
for(i in 1:length(vnames)){
dat[[i]]<-xml_text(xml_find_first(nodeset,paste0(".//d1:",vnames[[i]]),xml_ns(xmlfile)))
}
dat<-as.data.frame(dat,col.names=vnames)
But this method only works when xml_find_first
actually gives you everything you want. I could use xml_find_all
, but this makes the list lengths unequal for merging. I get a long list of C
s and I don't know which one goes with which item.
I can certainly loop through each item and xml_find_all
the C
s, but that seems inefficient.
Is there any easier way to do this?
Upvotes: 1
Views: 816
Reputation: 24079
Here is a possible solution, I am not sure if the final result is what you are looking for.
This works well if all of the data is only one level down. If data is stored multiple levels down in the xml then this solution needs to be extended. The basic approach is to parse all of the Item nodes out. Collect the information from all of the children nodes from in each item node, then create an item index by counting the number of children in each item. Then store all of the data in a 3 column data frame: ItemIndex, Child Name and value. From here it is a matter of converting to the desired final format.
library(xml2)
page<-read_xml("<root>
<Item>
<A>text1</A>
<B>text2</B>
<C>text3</C>
<C>text4</C>
<C>text5</C>
<D>text6</D>
</Item>
<Item>
<A>text12</A>
<B>text22</B>
<C>text32</C>
</Item>
</root>")
#find all items and store as a list
items<-xml_find_all(page, ".//Item")
#extract all children's names and values
nodenames<-xml_name(xml_children(items))
contents<-trimws(xml_text(xml_children(items)))
#Need to create an index to associate the nodes/contents with each item
itemindex<-rep(1:length(items), times=sapply(items, function(x) {length(xml_children(x))}))
#store all information in data frame.
df<-data.frame(itemindex, nodenames, contents)
#Convert from long to wide format
library(tidyr)
pivot_wider(df, id_cols= itemindex, names_from = nodenames,
values_from = contents) # %>% unnest(cols = c(A, B, C, D))
# A tibble: 2 x 5
itemindex A B C D
<int> <list<fct>> <list<fct>> <list<fct>> <list<fct>>
1 [1] [1] [3] [1]
2 [1] [1] [1] [0]
Upvotes: 1