amachure
amachure

Reputation: 43

Parsing through a deep-nested XML File in Python

I am looking at an xml file similar to the below:

<pinnacle_line_feed>
  <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
  <lastContest>28962804</lastContest>
  <lastGame>162995589</lastGame>
  <events>
    <event>
      <event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
      <gamenumber>422739932</gamenumber>
      <sporttype>Alpine Skiing</sporttype>
      <league>DH 145</league>
      <IsLive>No</IsLive>
      <participants>
        <participant>
          <participant_name>Kjetil Jansrud (NOR)</participant_name>
          <contestantnum>2001</contestantnum>
          <rotnum>2001</rotnum>
          <visiting_home_draw>Visiting</visiting_home_draw>
        </participant>
        <participant>
          <participant_name>The Field</participant_name>
          <contestantnum>2002</contestantnum>
          <rotnum>2002</rotnum>
          <visiting_home_draw>Home</visiting_home_draw>
        </participant>
      </participants>
      <periods>
        <period>
          <period_number>0</period_number>
          <period_description>Matchups</period_description>
          <periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
          <period_status>I</period_status>
          <period_update>open</period_update>
          <spread_maximum>200</spread_maximum>
          <moneyline_maximum>100</moneyline_maximum>
          <total_maximum>200</total_maximum>
          <moneyline>
            <moneyline_visiting>116</moneyline_visiting>
            <moneyline_home>-136</moneyline_home>
          </moneyline>
        </period>
      </periods>
      <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
    </event>
  </events>
</pinnacle_line_feed>

I have parsed the file with the code below:

pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'

tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []

for event in root.iter('event'):
    event_datetimeGMT = event.find('event_datetimeGMT').text
    gamenumber = event.find('gamenumber').text
    sporttype = event.find('sporttype').text
    league = event.find('league').text
    IsLive = event.find('IsLive').text
    for participants in event.iter('participants'):
        for participant in participants.iter('participant'):
            p1_name = participant.find('participant_name').text
            contestantnum  = participant.find('contestantnum').text
            rotnum = participant.find('rotnum').text
            vhd = participant.find('visiting_home_draw').text
    for periods in event.iter('periods'):
        for period in periods.iter('period'):
            period_number = period.find('period_number').text
            desc = period.find('period_description').text
            pdatetime = period.find('periodcutoff_datetimeGMT')
            status = period.find('period_status').text
            update = period.find('period_update').text
            max = period.find('spread_maximum').text
            mlmax = period.find('moneyline_maximum').text
            tot_max = period.find('total_maximum').text
            for moneyline in period.iter('moneyline'):
                ml_vis = moneyline.find('moneyline_visiting').text
                ml_home = moneyline.find('moneyline_home').text

However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.

Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:

for elt in tree.iter():
  list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))

which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.

Any assistance would be greatly appreciated.

Thank you.

Upvotes: 1

Views: 2801

Answers (1)

VooDooNOFX
VooDooNOFX

Reputation: 4762

Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)

import json
import pandas
import requests
import xmltodict

web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')

# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)

# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
#   visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))

# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)

To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:

>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo)  # Output the df to a CSV file
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball

Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:

# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball

Upvotes: 2

Related Questions