Parsing metadata properties from ADO.Net Data Services XML in Python

Question

I want to put some XML into a pandas dataframe before I stuff it into a database table. I've taken a look at Element Tree and lxml but the examples are really simple and I can't seem to extrapolate them to something this complex. I understand XML I'm just not sure how to drill down to what I need. A sample is below.

I'm after the stuff in . So NEW_DATE = 1997-01-02T00:00:00, BC_1YEAR = 5.630000114440918 etc. is what goes in the database. Notice how BC_1MONTH = NULL and is not like the other nodes.



  DailyTreasuryYieldCurveRateData
  http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData
  2017-10-30T20:31:53Z
  
  
    http://data.treasury.gov/Feed.svc/DailyTreasuryYieldCurveRateData(1)
    
    2017-10-30T20:31:53Z
    
      
    
    
    
    
      
        1
        1997-01-02T00:00:00
        
        5.190000057220459
        5.3499999046325684
        5.630000114440918
        5.96999979019165
        6.130000114440918
        6.3000001907348633
        6.4499998092651367
        6.5399999618530273
        6.8499999046325684
        6.75
        0

If you have links to a good article that talks about this, that would be appreciated too.

Below is the code I'm working with:

import xml.etree.ElementTree as ET
import pandas as pd

xml_data = open('/path/user_agents.xml').read()

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = []
    for i, child in enumerate(root):
        record = {}
        for subchild in child:
            record[subchild.tag] = subchild.text
            all_records.append(record)
    return pd.DataFrame(all_records)

Error message received from Duffy's code:

Traceback (most recent call last):
  File "C:/Users/Bob/Desktop/temp/yield curve script.py", line 25, in 
    xml2dict(xml_data)
  File "C:/Users/Bob/Desktop/temp/yield curve script.py", line 13, in xml2dict
    root = lxml.etree.parse(xml_file)
  File "src\lxml\lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81100)
  File "src\lxml\parser.pxi", line 1811, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:117831)
  File "src\lxml\parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:118178)
  File "src\lxml\parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:117090)
  File "src\lxml\parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:111636)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105092)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106800)
  File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105611)
OSError: Error reading file '

Charles Duffy · Accepted Answer

import lxml.etree
import datetime

nsmap = {
  'm': 'http://schemas.microsoft.com/ado/2007/08/dataservices/metadata',
  'd': 'http://schemas.microsoft.com/ado/2007/08/dataservices'
}
m_null = ('{%s}null' % nsmap['m'])
m_type = ('{%s}type' % nsmap['m'])

type_handlers = {
    'Edm.Double': float,
    'Edm.Int32': int,
    'Edm.DateTime': lambda s: datetime.datetime.strptime(s.translate(None, ':-'), "%Y%m%dT%H%M%S"),
}

def xml2dict(xml_file):
    root = lxml.etree.parse(xml_file)
    result = {}
    for properties_el in root.xpath('//m:properties', namespaces=nsmap):
        for child in properties_el.getchildren():
            tag = child.tag.split('}',1)[-1]  # split the namespace off the tag
            if child.attrib.get(m_null):
                value = None
            else:
                value = child.text
                type_handler = type_handlers.get(child.attrib.get(m_type))
                if type_handler is not None:
                    value = type_handler(value)
            result[tag] = value
    return result

...properly returns, for your data:

{'BC_10YEAR': 6.539999961853027,
 'BC_1MONTH': None,
 'BC_1YEAR': 5.630000114440918,
 'BC_20YEAR': 6.849999904632568,
 'BC_2YEAR': 5.96999979019165,
 'BC_30YEAR': 6.75,
 'BC_30YEARDISPLAY': 0.0,
 'BC_3MONTH': 5.190000057220459,
 'BC_3YEAR': 6.130000114440918,
 'BC_5YEAR': 6.300000190734863,
 'BC_6MONTH': 5.349999904632568,
 'BC_7YEAR': 6.449999809265137,
 'Id': 1,
 'NEW_DATE': datetime.datetime(1997, 1, 2, 0, 0)}

Parsing metadata properties from ADO.Net Data Services XML in Python

Answers (1)

Related Questions