Robert Schoellkopf
Robert Schoellkopf

Reputation: 9

How to parse XML data into a list in Python

I'm trying to take an API call response and parse the XML data into list, but I am struggling with the multiple child/parent relationships.

My hope is to export a new XML file that would line up each job ID and tracking number, which I could then import into Excel.


Here is what I have so far

The source XML file looks like this:

<project>
   <name>October 2019</name>
   <jobs>
      <job>
      <id>5654206</id>
      <tracking>
         <mailPiece>
             <barCode>00270200802095682022</barCode>
             <address>Accounts Payable,1661 Knott Ave,La Mirada,CA,90638</address>
             <status>En Route</status>
             <dateTime>2019-10-12 00:04:21.0</dateTime>
             <statusLocation>PONTIAC,MI</statusLocation>
        </mailPiece>
     </tracking>...

Code:

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement

tree = ET.parse('mailings.xml')
root = tree.getroot()
print(root.tag)

for x in root[1].findall('job'):
    id=x.find('id').text
    tracking=x.find('tracking').text
    print(root[1].tag,id,tracking)

The script currently returns the following:

jobs 5654206 None
jobs 5654203 None

Upvotes: 0

Views: 2084

Answers (1)

LightCC
LightCC

Reputation: 11639

Debugging is your friend...

I am struggling with the multiple child/parent relationships.

The right way to resolve this yourself is through using a debugger. For example, with VS Code, after applying a breakpoint and running the script with the debugger, it will stop at the breakpoint and I can inspect all the variables in memory, and run commands at the debug console just as if they were in my script. The Variable windows output looks like this:

enter image description here

There are various ways to do this at the command-line, or with a REPL like iPython, etc., but I find using debugging in a modern IDE environment like VS Code or PyCharm are definitely the way to go. Their debuggers remove the need to pepper print statements everywhere to test out your code, rewriting your code to expose more variables that must be printed to the console, etc.

A debugger allows you to see all the variables as a snapshot, exactly how the Python interpreter sees them, at any point in your code execution. You can:

  • step through your code line-by-line and watch the variables changes in real-time in the window
  • setup a separate watch window with only the variables you care about
  • and setup breakpoints that will only happen if variables are set to particular values, etc.

Child Hierarchy with the XML find method

Inspecting the variables as I step through your code, it appears that the find() method was walking the children within an Element at all levels, not just at the top level. When you used x.find('tracking') it is finding the mailPiece nodes directly. If you print the tag property, instead of the text property, you will see it is 'mailPiece' (see the debug windows above).

So, one way to resolve your issue is to store each mailPiece element as a variable, then pull out the individual attributes you want from it (i.e. BarCode, address, etc.) using find.

Here is some code that pulls all of this into a combined hierarchy of lists and dictionaries that you can then use to build your Excel outputs.

Note: The most efficient way to do this is line-by-line as you read the xml, but this is better for readability, maintainability, and if you need to do any post-processing that requires knowledge of more than one node at a time.

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement
from types import SimpleNamespace

tree = ET.parse('mailings.xml')
root = tree.getroot()

jobs = []
for job in root[1].findall('job'):
    jobdict = {}
    jobdict['id'] = job.find('id').text
    jobdict['trackingMailPieces'] = []
    for tracking in job.find('tracking'):
        if tracking.tag == 'mailPiece':
            mailPieceDict = {}
            mailPieceDict['barCode'] = tracking.find('barCode').text
            mailPieceDict['address'] = tracking.find('address').text
            mailPieceDict['status'] = tracking.find('status').text
            mailPieceDict['dateTime'] = tracking.find('dateTime').text
            mailPieceDict['statusLocation'] = tracking.find('statusLocation').text
            jobdict['trackingMailPieces'].append(mailPieceDict)
    jobs.append(jobdict)

for job in jobs:
    print('Job ID: {}'.format(job['id']))
    for mp in job['trackingMailPieces']:
        print('  mailPiece:')
        for key, value in mp.items():
            print('    {} = {}'.format(key, value))

The result is:

Job ID: 5654206
  mailPiece:
    barCode = 00270200802095682022
    address = Accounts Payable,1661 Knott Ave,La Mirada,CA,90638
    status = En Route
    dateTime = 2019-10-12 00:04:21.0
    statusLocation = PONTIAC,MI

Output?

I didn't address what to do with the output as that is beyond the scope of this question, but consider writing out to a CSV file, or even directly to an Excel file, if you don't need to pass on the XML to another program for some reason. There are Python packages that handle writing CSV and Excel files.

No need to create an intermediate format that you then need to manipulate after bringing it into Excel, for example.

Upvotes: 1

Related Questions