DarthOpto
DarthOpto

Reputation: 1652

Using BeautifulSoup to Parse XML to a Dictionary

I have a piece of XML that looks like this:

 <ns:Vehicle>
  <ns:Model>AVALON</ns:Model>
  <ns:ModelYear>1998</ns:ModelYear>
  <ns:MakeString>TY</ns:MakeString>
  <ns:VehicleID>VIN NUMBER GOES HERE</ns:VehicleID>
 </ns:Vehicle>

I have the following code to make the vehicle element into a dictionary:

xml_file = open('6046179.xml')
soup = BeautifulSoup(xml_file)

# Vehicle elements
el_model = soup.find('ns:model').text
el_model_year = soup.find('ns:modelyear').text
el_make_string = soup.find('ns:makestring').text
el_vehicle_id = soup.find('ns:vehicleid').text

vehicle = {'model': '{}'.format(el_model),
           'model_year': '{}'.format(el_model_year),
           'make_string': '{}'.format(el_make_string),
           'vehicle_id': '{}'.format(el_vehicle_id)}

print vehicle

I am just wondering if there is a better way to do this, I don't mind going through the rest of the elements in the XML and defining them individually like this, I just would like to know if there is a cleaner way to do this.

Upvotes: 1

Views: 7076

Answers (3)

declension
declension

Reputation: 4185

BeautifulSoup is not really meant for XML - it's ideal for messy HTML, that would break a proper parser.

You're much better off using etree interface (via, perhaps, the very fast lxml) which IIRC is what BS uses under the hood by default anyway. Then you get the root element and iterate over all its children in a few lines of code, e.g.:

#!/usr/bin/env python

import xml.etree.ElementTree as ET
import re

# Note the dummy namespace that must / should have been there...
xml = '''
 <ns:Vehicle xmlns:ns="http://foo.bar">
  <ns:Model>AVALON</ns:Model>
  <ns:ModelYear>1998</ns:ModelYear>
  <ns:MakeString>TY</ns:MakeString>
  <ns:VehicleID>VIN NUMBER GOES HERE</ns:VehicleID>
 </ns:Vehicle>'''

tree = ET.fromstring(xml)
vehicle = {re.sub(r'{.*}', '', node.tag): node.text for node in tree}

Upvotes: 1

senshin
senshin

Reputation: 10360

If you want all the child nodes inside the <ns:Vehicle> element, you don't need to explicitly specify them - just loop over all the child elements and put them in a dictionary.

from bs4 import BeautifulSoup

soup = BeautifulSoup('''
 <ns:Vehicle>
  <ns:Model>AVALON</ns:Model>
  <ns:ModelYear>1998</ns:ModelYear>
  <ns:MakeString>TY</ns:MakeString>
  <ns:VehicleID>VIN NUMBER GOES HERE</ns:VehicleID>
 </ns:Vehicle>
 ''')

# loop if you have multiple vehicles
# Note that BS normalizes all tag names to lowercase -> we use 'ns:vehicle' rather 'ns:Vehicle'
for el_vehicle in soup.find_all('ns:vehicle'): 
    vehicle = {child.name: child.text for child in el_vehicle.findChildren()}
    # stick `vehicle` in a list or do some other processing

This doesn't exactly match your output since it doesn't convert from camelcase to underscore-separated names (e.g. ModelYear to model_year), and it also doesn't strip the namespace off the element names. If you need that, it shouldn't be too difficult to include a wrapper around child.name to change the name accordingly.

Upvotes: 1

Curtis Mattoon
Curtis Mattoon

Reputation: 4742

Might be a bit cleaner, but essentially no different:

tags = ['model', 'modelyear', 'makestring', 'vehicleid']
vehicle = {}
for tag in tags:
    vehicle[tag] = '{}'.format(soup.find('ns:' + tag).text)

There's also xmltodict that might be worth a look.

Upvotes: 3

Related Questions