David Fombella Pombal
David Fombella Pombal

Reputation: 33

Parse xml to dataframe including children and attributes in R

I am trying to create a dataframe from attached xml https://1drv.ms/u/s!Am7buNMZi-gwgeBmbk6A-NRIRarjYw?e=Pcgm7c

I need to get for all players their column info and info about Team (Parent)

XML sample

<SoccerFeed timestamp="20190519T183022+0000">
  <SoccerDocument Type="SQUADS Latest" competition_code="ES_PL" competition_id="23" competition_name="Spanish La Liga" season_id="2018" season_name="Season 2018/2019">
    <Team country="Spain" country_id="4" country_iso="ES" official_club_name="Deportivo Alavés S.A.D." region_id="17" region_name="Europe" short_club_name="Alavés" uID="t173">
      <Founded>1921</Founded>
      <Name>Alavés</Name>
      <Player uID="p91406">
        <Name>Fernando Pacheco</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Fernando</Stat>
        <Stat Type="last_name">Pacheco</Stat>
        <Stat Type="birth_date">1992-05-18</Stat>
        <Stat Type="birth_place">Badajoz</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Left</Stat>
        <Stat Type="weight">81</Stat>
        <Stat Type="height">186</Stat>
        <Stat Type="jersey_num">1</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2015-08-07</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
      <Player uID="p176245">
        <Name>Antonio Sivera</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Antonio</Stat>
        <Stat Type="last_name">Sivera</Stat>
        <Stat Type="birth_date">1996-08-11</Stat>
        <Stat Type="birth_place">Jávea</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Right</Stat>
        <Stat Type="weight">75</Stat>
        <Stat Type="height">184</Stat>
        <Stat Type="jersey_num">13</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2017-07-19</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
     </Team>
     <Team city="Madrid" country="Spain" country_id="4" country_iso="ES" official_club_name="Club Atlético de Madrid S.A.D" postal_code="28005" region_id="17" region_name="Europe" short_club_name="Atlético" street="Paseo Virgen del Puerto, 67" uID="t175" web_address="www.clubatleticodemadrid.com/">
      <Founded>1903</Founded>
      <Name>Atlético de Madrid</Name>
      <Player uID="p59981">
        <Name>Antonio Adán</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Antonio</Stat>
        <Stat Type="last_name">Adán</Stat>
        <Stat Type="birth_date">1987-05-13</Stat>
        <Stat Type="birth_place">Madrid</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Left</Stat>
        <Stat Type="weight">92</Stat>
        <Stat Type="height">190</Stat>
        <Stat Type="jersey_num">1</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2018-07-10</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
      <Player uID="p81352">
        <Name>Jan Oblak</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Jan</Stat>
        <Stat Type="last_name">Oblak</Stat>
        <Stat Type="birth_date">1993-01-07</Stat>
        <Stat Type="birth_place">Skojfa Loka</Stat>
        <Stat Type="first_nationality">Slovenia</Stat>
        <Stat Type="preferred_foot">Right</Stat>
        <Stat Type="weight">87</Stat>
        <Stat Type="height">188</Stat>
        <Stat Type="jersey_num">13</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2014-07-16</Stat>
        <Stat Type="country">Slovenia</Stat>
      </Player>
     </Team>
   </SoccerDocument>
</SoccerFeed>



My desired columns

TEAM COLUMNS

PLAYER columns

I am not interested in Teams and Player nodes under Player changes below section /SoccerFeed/SoccerDocument/PlayerChanges

I started with tidyverse and xml2 to gather player info in combination with tidyverse but i was unable to get team parent info and different stats for players


library(xml2)
library(tidyverse)
library(plyr)



x <- read_xml("squads.xml")

players <- x %>% 
  xml_find_all('/SoccerFeed/SoccerDocument/Team/Player') %>% 
  map_df(~flatten(c(xml_attrs(.x), 
                    map(xml_children(.x), 
                        ~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
  type_convert()

Upvotes: 1

Views: 199

Answers (1)

Parfait
Parfait

Reputation: 107642

Since you use xml2 and require various data nodes that differ across nested levels, consider XSLT, the special-purpose language (like SQL) designed to transform XML files. In R, the xslt package, sister module to xml2, can run XSLT 1.0 scripts. The recursive, template nature of XSLT helps avoid complex nested loops or mapping at application layer, here being R. Plus XSLT is portable (like SQL) and can be run outside of R.

While this may be a whole new concept out of left field requiring a learning curve, it cleanly flattens your XML to the 2-D structure needed for data sets. You also separate XML handling (XSLT) from data handling (R). Specifically, only Player level is retained with respective Team data migrated down (see demo).

XSLT (save as .xsl, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/SoccerFeed|SoccerDocument">
      <xsl:apply-templates select="*"/>
  </xsl:template>

  <xsl:template match="Team">
      <xsl:apply-templates select="Player"/>
  </xsl:template>

  <xsl:template match="Team/@*">
    <xsl:element name="{concat('team_', name(.))}">
      <xsl:value-of select="."/>      
    </xsl:element>
  </xsl:template>

  <xsl:template match="Player">
    <xsl:copy>
      <xsl:apply-templates select="ancestor::Team/@*"/>
      <xsl:copy-of select="Name|Position"/>
      <xsl:apply-templates select="@*|Stat"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="Player/@*">
    <xsl:element name="{name(.)}">
      <xsl:value-of select="."/>      
    </xsl:element>
  </xsl:template>

  <xsl:template match="Stat">
    <xsl:element name="{@Type}">
      <xsl:value-of select="text()"/>     
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

Online Demo

R (results in data frame of all character types)

library(xml2)
library(xslt)
library(dplyr)

# INPUT SOURCE
doc <- read_xml("/path/to/Input.xml")
style <- read_xml("/path/to/Style.xsl", package = "xslt")

# TRANSFORM 
new_xml <- xml_xslt(doc, style)

# RETRIEVE Player NODES
recs <- xml_find_all(new_xml, "//Player")

# BIND EACH CHILD TEXT AND NAME TO Player DFs
df_list <- lapply(recs, function(r) 
    data.frame(rbind(setNames(xml_text(xml_children(r)), 
                              xml_name(xml_children(r)))),
               stringsAsFactors = FALSE)
)

# BIND ALL DFs TO SINGLE MASTER DF
final_df <- bind_rows(df_list)

Upvotes: 1

Related Questions