eable
eable

Reputation: 23

XML in R: Multiple Children with Same Name without Loops

I have an XML document like the one shown below:

<root>
    <Item>
        <A>text1</A>
        <B>text2</B>
        <C>text3</C>
        <C>text4</C>
        <C>text5</C>
        <D>text6</D>
        ...
    </Item>
    <Item>
        ...
    </Item>
    ...
</root>

It's relatively simple with one complicating factor: Each item can have any number of Cs.

Ultimately, I'd love to have this in a table like:

  A     B     C          D    
1 text1 text2 <list [3]> text6

I've already created my table for the other variables (in what's probably a messy way, but it works):

vnames<-c("A","B","D")
dat<-list()
for(i in 1:length(vnames)){
    dat[[i]]<-xml_text(xml_find_first(nodeset,paste0(".//d1:",vnames[[i]]),xml_ns(xmlfile)))
}
dat<-as.data.frame(dat,col.names=vnames)

But this method only works when xml_find_first actually gives you everything you want. I could use xml_find_all, but this makes the list lengths unequal for merging. I get a long list of Cs and I don't know which one goes with which item.

I can certainly loop through each item and xml_find_all the Cs, but that seems inefficient.
Is there any easier way to do this?

Upvotes: 1

Views: 816

Answers (1)

Dave2e
Dave2e

Reputation: 24079

Here is a possible solution, I am not sure if the final result is what you are looking for.

This works well if all of the data is only one level down. If data is stored multiple levels down in the xml then this solution needs to be extended. The basic approach is to parse all of the Item nodes out. Collect the information from all of the children nodes from in each item node, then create an item index by counting the number of children in each item. Then store all of the data in a 3 column data frame: ItemIndex, Child Name and value. From here it is a matter of converting to the desired final format.

library(xml2)

page<-read_xml("<root>
    <Item>
        <A>text1</A>
        <B>text2</B>
        <C>text3</C>
        <C>text4</C>
        <C>text5</C>
        <D>text6</D>
    </Item>
    <Item>
        <A>text12</A>
        <B>text22</B>
        <C>text32</C>
    </Item>
</root>")

#find all items and store as a list
items<-xml_find_all(page, ".//Item")

#extract all children's names and values 
nodenames<-xml_name(xml_children(items))
contents<-trimws(xml_text(xml_children(items)))

#Need to create an index to associate the nodes/contents with each item
itemindex<-rep(1:length(items), times=sapply(items, function(x) {length(xml_children(x))}))

#store all information in data frame.
df<-data.frame(itemindex, nodenames, contents)

#Convert from long to wide format
library(tidyr)
pivot_wider(df, id_cols= itemindex, names_from = nodenames,
            values_from = contents)  # %>% unnest(cols = c(A, B, C, D))

# A tibble: 2 x 5
itemindex       A           B           C           D
<int> <list<fct>> <list<fct>> <list<fct>> <list<fct>>
    1         [1]         [1]         [3]         [1]
    2         [1]         [1]         [1]         [0]

Upvotes: 1

Related Questions