XML in R: Multiple Children with Same Name without Loops

Question

I have an XML document like the one shown below:


    
        text1
        text2
        text3
        text4
        text5
        text6
        ...
    
    
        ...
    
    ...

It's relatively simple with one complicating factor: Each item can have any number of Cs.

Ultimately, I'd love to have this in a table like:

  A     B     C          D    
1 text1 text2  text6

I've already created my table for the other variables (in what's probably a messy way, but it works):

vnames<-c("A","B","D")
dat<-list()
for(i in 1:length(vnames)){
    dat[[i]]<-xml_text(xml_find_first(nodeset,paste0(".//d1:",vnames[[i]]),xml_ns(xmlfile)))
}
dat<-as.data.frame(dat,col.names=vnames)

But this method only works when xml_find_first actually gives you everything you want. I could use xml_find_all, but this makes the list lengths unequal for merging. I get a long list of Cs and I don't know which one goes with which item.

I can certainly loop through each item and xml_find_all the Cs, but that seems inefficient.
Is there any easier way to do this?

Dave2e · Accepted Answer

Here is a possible solution, I am not sure if the final result is what you are looking for.

This works well if all of the data is only one level down. If data is stored multiple levels down in the xml then this solution needs to be extended. The basic approach is to parse all of the Item nodes out. Collect the information from all of the children nodes from in each item node, then create an item index by counting the number of children in each item. Then store all of the data in a 3 column data frame: ItemIndex, Child Name and value. From here it is a matter of converting to the desired final format.

library(xml2)

page<-read_xml("
    
        text1
        text2
        text3
        text4
        text5
        text6
    
    
        text12
        text22
        text32
    
")

#find all items and store as a list
items<-xml_find_all(page, ".//Item")

#extract all children's names and values 
nodenames<-xml_name(xml_children(items))
contents<-trimws(xml_text(xml_children(items)))

#Need to create an index to associate the nodes/contents with each item
itemindex<-rep(1:length(items), times=sapply(items, function(x) {length(xml_children(x))}))

#store all information in data frame.
df<-data.frame(itemindex, nodenames, contents)

#Convert from long to wide format
library(tidyr)
pivot_wider(df, id_cols= itemindex, names_from = nodenames,
            values_from = contents)  # %>% unnest(cols = c(A, B, C, D))

# A tibble: 2 x 5
itemindex       A           B           C           D
 > > > >
    1         [1]         [1]         [3]         [1]
    2         [1]         [1]         [1]         [0]

XML in R: Multiple Children with Same Name without Loops

Answers (1)

Related Questions