Parsing through a deep-nested XML File in Python

Question

I am looking at an xml file similar to the below:


  1418929691920
  28962804
  162995589
  
    
      2014-12-19 11:15
      422739932
      Alpine Skiing
      DH 145
      No
      
        
          Kjetil Jansrud (NOR)
          2001
          2001
          Visiting
        
        
          The Field
          2002
          2002
          Home
        
      
      
        
          0
          Matchups
          2014-12-19 11:15
          I
          open
          200
          100
          200
          
            116
            -136
          
        
      
      1418929691920

I have parsed the file with the code below:

pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'

tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []

for event in root.iter('event'):
    event_datetimeGMT = event.find('event_datetimeGMT').text
    gamenumber = event.find('gamenumber').text
    sporttype = event.find('sporttype').text
    league = event.find('league').text
    IsLive = event.find('IsLive').text
    for participants in event.iter('participants'):
        for participant in participants.iter('participant'):
            p1_name = participant.find('participant_name').text
            contestantnum  = participant.find('contestantnum').text
            rotnum = participant.find('rotnum').text
            vhd = participant.find('visiting_home_draw').text
    for periods in event.iter('periods'):
        for period in periods.iter('period'):
            period_number = period.find('period_number').text
            desc = period.find('period_description').text
            pdatetime = period.find('periodcutoff_datetimeGMT')
            status = period.find('period_status').text
            update = period.find('period_update').text
            max = period.find('spread_maximum').text
            mlmax = period.find('moneyline_maximum').text
            tot_max = period.find('total_maximum').text
            for moneyline in period.iter('moneyline'):
                ml_vis = moneyline.find('moneyline_visiting').text
                ml_home = moneyline.find('moneyline_home').text

However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.

Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:

for elt in tree.iter():
  list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))

which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.

Any assistance would be greatly appreciated.

Thank you.

VooDooNOFX · Accepted Answer

Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)

import json
import pandas
import requests
import xmltodict

web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')

# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)

# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
#   visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))

# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)

To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:

>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo)  # Output the df to a CSV file
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball

Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:

# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball

Parsing through a deep-nested XML File in Python

Answers (1)

Related Questions