Canovice
Canovice

Reputation: 10163

Parsing XML when there's different # of nodes

I figure this question may have been asked before, but after research I couldn't find anything. I am new to parsing XML documents. I am trying to parse an XML page that looks like this:

schedule = xmlParse("MYXML.XML")

# here's what schedule looks like
<all-games>
  <game-schedule gameid="1">
    <player name="Joe">
    <player name="Mike">
    <player name="Steve">
  </game-schedule>
  <game-schedule gameid="2">
    <player name="Rick">
    <player name="John">
    <player name="Karl">  
  </game-schedule>
</all-games>


# here's my code to parse the XML
my_df = data.frame(
  gameID = sapply(schedule["//game-schedule/@gameid"], as, "integer"),
  player = sapply(schedule["//game-schedule/player/@name"], as, "character")
)

but it doesn't give me what i want. I would like to parse the dataframe such that the gameID is repeated for the number of players in the game. That is, I am trying to get the following dataframe:

my_df
    gameID    player
1        1       Joe
2        1      Mike
3        1     Steve
4        2      Rick
5        2      John
6        2      Karl

Huge appreciation for any help with this. Seems like something I should be able to do, because often times I will need to parse XMLs in this way.

Upvotes: 0

Views: 62

Answers (2)

Chris S.
Chris S.

Reputation: 2225

Here's another way using fill in the tidy package

XML:::xmlAttrsToDataFrame(schedule["//all-games//*"]) %>% fill(gameid) %>% filter(!is.na(name)) 
  gameid  name
1      1   Joe
2      1  Mike
3      1 Steve
4      2  Rick
5      2  John
6      2  Karl

Upvotes: 1

Parfait
Parfait

Reputation: 107567

Consider using node indexing in XPath iterating through the length of <games-schedule> nodes:

library(XML)

xmlstr <- '<all-games>
              <game-schedule gameid="1">
                <player name="Joe"/>
                <player name="Mike"/>
                <player name="Steve"/>
              </game-schedule>
              <game-schedule gameid="2">
                <player name="Rick"/>
                <player name="John"/>
                <player name="Karl"/>  
              </game-schedule>
            </all-games>'

schedule <- xmlParse(xmlstr)

games <- length(xpathSApply(schedule, "//game-schedule"))

dfList <- lapply(seq(games), function(i){
  data.frame(
    gameID = sapply(schedule[paste0("//game-schedule[",i,"]/@gameid")], as, "integer"),
    player = sapply(schedule[paste0("//game-schedule[",i,"]/player/@name")], as, "character")
  )
})

my_df <- do.call(rbind, dfList)

#   gameID player
# 1      1    Joe
# 2      1   Mike
# 3      1  Steve
# 4      2   Rick
# 5      2   John
# 6      2   Karl

Upvotes: 3

Related Questions