Reputation: 10163
I figure this question may have been asked before, but after research I couldn't find anything. I am new to parsing XML documents. I am trying to parse an XML page that looks like this:
schedule = xmlParse("MYXML.XML")
# here's what schedule looks like
<all-games>
<game-schedule gameid="1">
<player name="Joe">
<player name="Mike">
<player name="Steve">
</game-schedule>
<game-schedule gameid="2">
<player name="Rick">
<player name="John">
<player name="Karl">
</game-schedule>
</all-games>
# here's my code to parse the XML
my_df = data.frame(
gameID = sapply(schedule["//game-schedule/@gameid"], as, "integer"),
player = sapply(schedule["//game-schedule/player/@name"], as, "character")
)
but it doesn't give me what i want. I would like to parse the dataframe such that the gameID is repeated for the number of players in the game. That is, I am trying to get the following dataframe:
my_df
gameID player
1 1 Joe
2 1 Mike
3 1 Steve
4 2 Rick
5 2 John
6 2 Karl
Huge appreciation for any help with this. Seems like something I should be able to do, because often times I will need to parse XMLs in this way.
Upvotes: 0
Views: 62
Reputation: 2225
Here's another way using fill
in the tidy
package
XML:::xmlAttrsToDataFrame(schedule["//all-games//*"]) %>% fill(gameid) %>% filter(!is.na(name))
gameid name
1 1 Joe
2 1 Mike
3 1 Steve
4 2 Rick
5 2 John
6 2 Karl
Upvotes: 1
Reputation: 107567
Consider using node indexing in XPath iterating through the length of <games-schedule>
nodes:
library(XML)
xmlstr <- '<all-games>
<game-schedule gameid="1">
<player name="Joe"/>
<player name="Mike"/>
<player name="Steve"/>
</game-schedule>
<game-schedule gameid="2">
<player name="Rick"/>
<player name="John"/>
<player name="Karl"/>
</game-schedule>
</all-games>'
schedule <- xmlParse(xmlstr)
games <- length(xpathSApply(schedule, "//game-schedule"))
dfList <- lapply(seq(games), function(i){
data.frame(
gameID = sapply(schedule[paste0("//game-schedule[",i,"]/@gameid")], as, "integer"),
player = sapply(schedule[paste0("//game-schedule[",i,"]/player/@name")], as, "character")
)
})
my_df <- do.call(rbind, dfList)
# gameID player
# 1 1 Joe
# 2 1 Mike
# 3 1 Steve
# 4 2 Rick
# 5 2 John
# 6 2 Karl
Upvotes: 3