Canovice
Canovice

Reputation: 10441

For each node, calculate number of child nodes using XML package in R

I am fairly new to XML parsing and am trying to parse through some basketball NBA sportVU data. I have an XML file that looks as such (in a summarized format):

<quarter number="1">
  <possession team-id="30" points="3"/>
  <possession team-id="23" points="1"/>
  <possession team-id="30" points="2"/>
</quarter>
<quarter number="2">
  <possession team-id="23" points="3"/>
  <possession team-id="30" points="2"/>
</quarter>
<quarter number="3">
  <possession team-id="30" points="1"/>
  <possession team-id="30" points="1"/>
  <possession team-id="30" points="1"/>
  <possession team-id="23" points="2"/>
</quarter>
<quarter number="4">
  <possession team-id="23" points="2"/>
</quarter>

I have created a dataframe that has the team-id as 1 column, and the number of points as another column, as such:

library(XML)
data = xmlParse("myXMLfile.XML")

my_df <- data.frame(
  team.id = sapply(data["//quarter/possession/@team-id"], as, "integer"),
  points = sapply(data["//quarter/possession/@points"], as, "integer")
)

my_df
   team.id  points 
1       30       3
2       23       1
3       30       2
4       23       3
5       30       2 
6       30       1
7       30       1
8       30       1
9       23       2
10      23       2

I would like to add a 3rd column to this, labeled quarters, that would update the dataframe to look like this:

my_new_df
   team.id  points  quarter
1       30       3        1
2       23       1        1      
3       30       2        1
4       23       3        2
5       30       2        2
6       30       1        3
7       30       1        3
8       30       1        3
9       23       2        3
10      23       2        4

I figure the easiest way to be able to do this, is to grab the unique quarter numbers in the vector, and then repeat the vector by the number of child nodes below each quarter node. Does anybody know how I can achieve this? I am open to generally different approaches, that do not involve the XML package (for example, if there is an xml2 solution).

Thanks!

Upvotes: 0

Views: 645

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99351

Looks like this would work, beginning with the original document, data (which I call doc). First a little helper function to get the desired information into the desired form.

helper <- function(x) {
    as.data.frame.list(c(xmlAttrs(x), quarter = unname(xmlAttrs(xmlParent(x)))))
}

Now we can run our helper function across the nodes with lapply() and bring the resulting list into a data frame with rbind().

do.call(rbind, lapply(doc["//quarter/*"], helper))
#    team.id points quarter
# 1       30      3       1
# 2       23      1       1
# 3       30      2       1
# 4       23      3       2
# 5       30      2       2
# 6       30      1       3
# 7       30      1       3
# 8       30      1       3
# 9       23      2       3
# 10      23      2       4

Data:

library(XML)
doc <- htmlParse('<quarter number="1">
  <possession team-id="30" points="3"/>
  <possession team-id="23" points="1"/>
  <possession team-id="30" points="2"/>
</quarter>
<quarter number="2">
  <possession team-id="23" points="3"/>
  <possession team-id="30" points="2"/>
</quarter>
<quarter number="3">
  <possession team-id="30" points="1"/>
  <possession team-id="30" points="1"/>
  <possession team-id="30" points="1"/>
  <possession team-id="23" points="2"/>
</quarter>
<quarter number="4">
  <possession team-id="23" points="2"/>
</quarter>')

Upvotes: 1

Canovice
Canovice

Reputation: 10441

Something like this seems to work, albeit not the greatest solution in my opinion. it uses the XML::xmlChildren function:

zed = possessions["//quarter"]
unlist(lapply(zed, FUN = function(x) length(XML::xmlChildren(x))))

3 2 4 1

Upvotes: 0

Related Questions