For each node, calculate number of child nodes using XML package in R

Question

I am fairly new to XML parsing and am trying to parse through some basketball NBA sportVU data. I have an XML file that looks as such (in a summarized format):

I have created a dataframe that has the team-id as 1 column, and the number of points as another column, as such:

library(XML)
data = xmlParse("myXMLfile.XML")

my_df <- data.frame(
  team.id = sapply(data["//quarter/possession/@team-id"], as, "integer"),
  points = sapply(data["//quarter/possession/@points"], as, "integer")
)

my_df
   team.id  points 
1       30       3
2       23       1
3       30       2
4       23       3
5       30       2 
6       30       1
7       30       1
8       30       1
9       23       2
10      23       2

I would like to add a 3rd column to this, labeled quarters, that would update the dataframe to look like this:

my_new_df
   team.id  points  quarter
1       30       3        1
2       23       1        1      
3       30       2        1
4       23       3        2
5       30       2        2
6       30       1        3
7       30       1        3
8       30       1        3
9       23       2        3
10      23       2        4

I figure the easiest way to be able to do this, is to grab the unique quarter numbers in the vector, and then repeat the vector by the number of child nodes below each quarter node. Does anybody know how I can achieve this? I am open to generally different approaches, that do not involve the XML package (for example, if there is an xml2 solution).

Thanks!

Rich Scriven · Accepted Answer

Looks like this would work, beginning with the original document, data (which I call doc). First a little helper function to get the desired information into the desired form.

helper <- function(x) {
    as.data.frame.list(c(xmlAttrs(x), quarter = unname(xmlAttrs(xmlParent(x)))))
}

Now we can run our helper function across the nodes with lapply() and bring the resulting list into a data frame with rbind().

do.call(rbind, lapply(doc["//quarter/*"], helper))
#    team.id points quarter
# 1       30      3       1
# 2       23      1       1
# 3       30      2       1
# 4       23      3       2
# 5       30      2       2
# 6       30      1       3
# 7       30      1       3
# 8       30      1       3
# 9       23      2       3
# 10      23      2       4

Data:

library(XML)
doc <- htmlParse('
  
  
  


  
  


  
  
  
  


  
')

For each node, calculate number of child nodes using XML package in R

Answers (2)

Related Questions