convert xml data into pandas dataframe in python

Question

I want to turn the following data into a dataframe, de xml file looks like this:




  
    sort order
  
  
    
      1
      11.24
      31.24
      19. Luis Alehandru Paz Mulato
      
        Team
        America de Cali
      
      
        Action
        Passes accurate
      
      
        Half
        1st half
      
      52.6
      34.2
    
    
      2
      11.24
      31.24
      19. Luis Alehandru Paz Mulato
      
        Team
        America de Cali
      
      
        Action
        Positional attacks
      
      
        Half
        1st half
      
      52.6
      34.2
    
    
      3
      14.43
      34.43
      6. Felipe Jaramillo Velasquez
      
        Team
        America de Cali
      
      
        Action
        Positional attacks
      
      
        Half
        1st half
      
      38.6
      26.2

After there is a little bit more, but I only need the data inside the 'instance' tags as if those were the rows. So my columns would be as follows:

['ID', 'start', 'end', 'code', 'group1', 'text1', 'group2', 'text2', 'group3', 'text3', 'pos_x','pos_y']

and the first row would be:

['1', '11.24', '31.24', '19. Luis Alehandru Paz Mulato', 'Team', 'America de Cali', 'Action', 'Passes accurate', 'Half', '1st half', '52.6','34.2']

It is my firts time working with xml files, thank you so much for your help.

I already tried the following:

import xml.etree.ElementTree as et 

xtree = et.parse("players_vs_nacional_0-3.xml")
xroot = xtree.getroot()
    

for node in xroot:         
    id_action = node.find("ID").text   
    start = node.find("start").text 
    end = node.find("end").text 
    code = node.find("code").text 
    posx = node.find("pos_x").text 
    posy = node.find("pos_y").text

But I got the error:

AttributeError: 'NoneType' object has no attribute 'text'

I believe it is because the first nodes don't contain what I was looking for ("ID", "start", "end", etc), but I couldn't get passed that and I am also having trouble trying to figure out how to get the info out of the tags because there are three of them for every and I need them all and in the same row.

Jack Fleeting · Accepted Answer

I believe you may be looking for something like this:

from lxml import etree
import pandas as pd
inst = """[your xml above, corrected; it's missing a closing  tag]"""
doc = etree.XML(inst.encode())

rows =[]
targets = doc.xpath('//instance')
columns = [elem.tag for elem in targets[0].xpath('.//*') if len(elem.getchildren())==0 ]

for target in targets:
    row = []
    for item in target.xpath('.//*'):            
        if len(item.getchildren())>0:
            continue
        else:
            row.append(item.text.strip())
    rows.append(row)
df = pd.DataFrame(rows,columns=columns)
df

Output:

ID  start   end     code    group   text    group   text    group   text    pos_x   pos_y
0   1   11.24   31.24   19. Luis Alehandru Paz Mulato   Team    America de Cali     Action  Passes accurate     Half    1st half    52.6    34.2
1   2   11.24   31.24   19. Luis Alehandru Paz Mulato   Team    America de Cali     Action  Positional attacks  Half    1st half    52.6    34.2
2   3   14.43   34.43   6. Felipe Jaramillo Velasquez   Team    America de Cali     Action  Positional attacks  Half    1st half    38.6    26.2

convert xml data into pandas dataframe in python

Answers (2)

Related Questions