timnus
timnus

Reputation: 197

How to extract XML attributes when unnesting XML that has been converted to a dataframe in R?

I have an XML file that looks like this:

xml_reprex <- '<Athletes>
  <Athlete>
    <AthleteIdentifier Id="157" />
      <TotalGames>354</TotalGames>
        <GamesByType>
          <Games Type="Regular">301</Games>
          <Games Type="Finals">53</Games>
        </GamesByType>
  </Athlete>
  <Athlete>
    <AthleteIdentifier Id="223" />
      <TotalGames>154</TotalGames>
        <GamesByType>
          <Games Type="Regular">142</Games>
          <Games Type="Finals">12</Games>
        </GamesByType>
  </Athlete>
</Athletes>'

I would like to turn this into a tidy data frame with columns AthleteIdentifier, TotalGames, GameType and Games.

Here is my current attempt:

df <- read_xml(xml_reprex) 

as_tibble(as_list(df)) %>% 
  unnest_wider(Athletes, names_repair = "universal") %>% 
  unnest(TotalGames) %>% 
  unnest(TotalGames) %>% 
  unnest(GamesByType) %>% 
  unnest(GamesByType) %>% 
  unnest(GamesByType)

However, I lose the AthleteIdentifier values as they are listed as an attribute, similarly I lose the Games type as it is an attribute as well.

Is this because I have converted it to a tibble and therefore I lose all those attributes? Is there a way I can retain that information in the conversion to a tibble for for manipulation?

Thanks.

Upvotes: 0

Views: 136

Answers (1)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Here's one approach. It relies mostly on tidyr's function extract, which uses regular expression to describe the string in toto and separate out strings to be extracted with capturing groups in parentheses:

library(dplyr)
library(tidyr)
data.frame(xml_reprex) %>%
  mutate(xml_reprex = gsub("</?Athletes>|\n\\s+", "", xml_reprex)) %>%
  separate_rows(xml_reprex, sep = "(?<!^)<Athlete>") %>%
  extract(xml_reprex,
          into = c("AthleteIdentifier", "TotalGames", "GameType_Regular", "GameType_Finals"),
          regex = 'AthleteIdentifier Id="(\\d+)".*TotalGames>(\\d+)<.*Games Type="Regular">(\\d+).*Games Type="Finals">(\\d+)')
# A tibble: 2 × 4
  AthleteIdentifier TotalGames GameType_Regular GameType_Finals
  <chr>             <chr>      <chr>            <chr>          
1 157               354        301              53             
2 223               154        142              12   

Upvotes: 1

Related Questions