Reputation: 197
I have an XML file that looks like this:
xml_reprex <- '<Athletes>
<Athlete>
<AthleteIdentifier Id="157" />
<TotalGames>354</TotalGames>
<GamesByType>
<Games Type="Regular">301</Games>
<Games Type="Finals">53</Games>
</GamesByType>
</Athlete>
<Athlete>
<AthleteIdentifier Id="223" />
<TotalGames>154</TotalGames>
<GamesByType>
<Games Type="Regular">142</Games>
<Games Type="Finals">12</Games>
</GamesByType>
</Athlete>
</Athletes>'
I would like to turn this into a tidy data frame with columns AthleteIdentifier
, TotalGames
, GameType
and Games
.
Here is my current attempt:
df <- read_xml(xml_reprex)
as_tibble(as_list(df)) %>%
unnest_wider(Athletes, names_repair = "universal") %>%
unnest(TotalGames) %>%
unnest(TotalGames) %>%
unnest(GamesByType) %>%
unnest(GamesByType) %>%
unnest(GamesByType)
However, I lose the AthleteIdentifier
values as they are listed as an attribute, similarly I lose the Games
type as it is an attribute as well.
Is this because I have converted it to a tibble and therefore I lose all those attributes? Is there a way I can retain that information in the conversion to a tibble for for manipulation?
Thanks.
Upvotes: 0
Views: 136
Reputation: 21400
Here's one approach. It relies mostly on tidyr
's function extract
, which uses regular expression to describe the string in toto and separate out strings to be extracted with capturing groups in parentheses:
library(dplyr)
library(tidyr)
data.frame(xml_reprex) %>%
mutate(xml_reprex = gsub("</?Athletes>|\n\\s+", "", xml_reprex)) %>%
separate_rows(xml_reprex, sep = "(?<!^)<Athlete>") %>%
extract(xml_reprex,
into = c("AthleteIdentifier", "TotalGames", "GameType_Regular", "GameType_Finals"),
regex = 'AthleteIdentifier Id="(\\d+)".*TotalGames>(\\d+)<.*Games Type="Regular">(\\d+).*Games Type="Finals">(\\d+)')
# A tibble: 2 × 4
AthleteIdentifier TotalGames GameType_Regular GameType_Finals
<chr> <chr> <chr> <chr>
1 157 354 301 53
2 223 154 142 12
Upvotes: 1