Reputation: 10441
I am fairly new to XML parsing and am trying to parse through some basketball NBA sportVU data. I have an XML file that looks as such (in a summarized format):
<quarter number="1">
<possession team-id="30" points="3"/>
<possession team-id="23" points="1"/>
<possession team-id="30" points="2"/>
</quarter>
<quarter number="2">
<possession team-id="23" points="3"/>
<possession team-id="30" points="2"/>
</quarter>
<quarter number="3">
<possession team-id="30" points="1"/>
<possession team-id="30" points="1"/>
<possession team-id="30" points="1"/>
<possession team-id="23" points="2"/>
</quarter>
<quarter number="4">
<possession team-id="23" points="2"/>
</quarter>
I have created a dataframe that has the team-id as 1 column, and the number of points as another column, as such:
library(XML)
data = xmlParse("myXMLfile.XML")
my_df <- data.frame(
team.id = sapply(data["//quarter/possession/@team-id"], as, "integer"),
points = sapply(data["//quarter/possession/@points"], as, "integer")
)
my_df
team.id points
1 30 3
2 23 1
3 30 2
4 23 3
5 30 2
6 30 1
7 30 1
8 30 1
9 23 2
10 23 2
I would like to add a 3rd column to this, labeled quarters, that would update the dataframe to look like this:
my_new_df
team.id points quarter
1 30 3 1
2 23 1 1
3 30 2 1
4 23 3 2
5 30 2 2
6 30 1 3
7 30 1 3
8 30 1 3
9 23 2 3
10 23 2 4
I figure the easiest way to be able to do this, is to grab the unique quarter numbers in the vector, and then repeat the vector by the number of child nodes below each quarter node. Does anybody know how I can achieve this? I am open to generally different approaches, that do not involve the XML package (for example, if there is an xml2 solution).
Thanks!
Upvotes: 0
Views: 645
Reputation: 99351
Looks like this would work, beginning with the original document, data
(which I call doc
). First a little helper function to get the desired information into the desired form.
helper <- function(x) {
as.data.frame.list(c(xmlAttrs(x), quarter = unname(xmlAttrs(xmlParent(x)))))
}
Now we can run our helper function across the nodes with lapply()
and bring the resulting list into a data frame with rbind()
.
do.call(rbind, lapply(doc["//quarter/*"], helper))
# team.id points quarter
# 1 30 3 1
# 2 23 1 1
# 3 30 2 1
# 4 23 3 2
# 5 30 2 2
# 6 30 1 3
# 7 30 1 3
# 8 30 1 3
# 9 23 2 3
# 10 23 2 4
Data:
library(XML)
doc <- htmlParse('<quarter number="1">
<possession team-id="30" points="3"/>
<possession team-id="23" points="1"/>
<possession team-id="30" points="2"/>
</quarter>
<quarter number="2">
<possession team-id="23" points="3"/>
<possession team-id="30" points="2"/>
</quarter>
<quarter number="3">
<possession team-id="30" points="1"/>
<possession team-id="30" points="1"/>
<possession team-id="30" points="1"/>
<possession team-id="23" points="2"/>
</quarter>
<quarter number="4">
<possession team-id="23" points="2"/>
</quarter>')
Upvotes: 1
Reputation: 10441
Something like this seems to work, albeit not the greatest solution in my opinion. it uses the XML::xmlChildren function:
zed = possessions["//quarter"]
unlist(lapply(zed, FUN = function(x) length(XML::xmlChildren(x))))
3 2 4 1
Upvotes: 0