Joe
Joe

Reputation: 465

XML not parsing as expected with BeautifulSoup

I am trying to parse XML from a website. I have no control over the content if it’s not formatted properly. A very simplified example of the XML data is below.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<items:items itemId="1">
    <parameter name="param1" value="A"/>
    <parameter name="param2" value="B"/>
    <product productid="test1">
        <parameter name="prodinfo1" value="Q"/>
        <parameter name="prodinfo2" value="R"/>
    </product>
    <product productid="test2">
        <parameter name="prodinfo1" value="S"/>
        <parameter name="prodinfo2" value="T"/>
    </product>
</items:items>
<items:items itemId="2">
    <parameter name="param1" value="C"/>
    <parameter name="param2" value="D"/>
    <product productid="test3">
        <parameter name="prodinfo1" value="U"/>
        <parameter name="prodinfo2" value="V"/>
    </product>
    <product productid="test4">
        <parameter name="prodinfo1" value="W"/>
        <parameter name="prodinfo2" value="X"/>
    </product>
</items:items>

I wrote a short Python 2.7 script using BeautifulSoup 3.2.1 to parse the XML (I am constrained to using these versions, so unfortunately upgrading is not an option).

from BeautifulSoup import BeautifulStoneSoup

def main():
    fieldList = ('param1','param2')
    prodFieldList = ('prodinfo1','prodinfo2')
    xmlfile = 'test.xml'
    xmldata = open(xmlfile).read()
    soup = BeautifulStoneSoup(xmldata)
    print soup.prettify()

    for message in soup.findAll('items:items', recursive=False):
        report = {}
        for field in fieldList:
            report[field] = '{}'.format(message.find(attrs={"name" : field})['value'])
        for product in message.findAll('product', recursive=False):
            prodreport = {}
            for field in prodFieldList:
                prodreport[field] = '{}'.format(product.find(attrs={"name" : field})['value'])

if __name__ == "__main__":
    main()

For some reason, the parameters within <product></product> such as prodinfo1 and prodinfo2 do not show up. When I look at the output from soup.prettify(), rather than indenting as displayed in my XML file above, I can see that the product parameters are being listed outside the <product></product> tags, and thus their identity with a particular product is lost:

<?xml version='1.0' encoding='utf-8'?>
<items:items itemid="1">
 <parameter name="param1" value="A">
 </parameter>
 <parameter name="param2" value="B">
  <product productid="test1">
  </product>
 </parameter>
 <parameter name="prodinfo1" value="Q">
 </parameter>
 <parameter name="prodinfo2" value="R">
  <product productid="test2">
  </product>
 </parameter>
 <parameter name="prodinfo1" value="S">
 </parameter>
 <parameter name="prodinfo2" value="T">
 </parameter>
</items:items>
<items:items itemid="2">
 <parameter name="param1" value="C">
 </parameter>
 <parameter name="param2" value="D">
  <product productid="test3">
  </product>
 </parameter>
 <parameter name="prodinfo1" value="U">
 </parameter>
 <parameter name="prodinfo2" value="V">
  <product productid="test4">
  </product>
 </parameter>
 <parameter name="prodinfo1" value="W">
 </parameter>
 <parameter name="prodinfo2" value="X">
 </parameter>
</items:items>

I have been searching but haven’t found anyone with the same problem. Why is this happening, and what can I do to properly parse this XML? Thank you for your time.

Upvotes: 2

Views: 952

Answers (1)

Logan
Logan

Reputation: 56

It works for me after making 3 changes:

  • 0) I'm using bs4 (this is the only version I have installed)

  • 1) BeautifulSoup(xmldata, features="xml") instead of BeautifulStoneSoup(xmldata), BeautifulStoneSoup is depreciated in bs4

  • 2) I changed soup.findAll('items:items', recursive=False) to soup.findAll(True, {"itemId":True}, recursive=False)

    from bs4 import BeautifulSoup
    
    xmldata = #load your data 
    
    if __name__ == "__main__":
    
        fieldList = ('param1','param2')
        prodFieldList = ('prodinfo1','prodinfo2')
        soup = BeautifulSoup(xmldata, features="xml")# <- notice this
        print soup.prettify(), "\n"
    
        for message in soup.findAll(True, {"itemId":True}, recursive=False):# <- and this
    
            report = {}
            for field in fieldList:
                report[field] = '{}'.format(message.find(attrs={"name" : field})['value'])
                print report
    
            for product in message.findAll('product', recursive=False):
                prodreport = {}
                for field in prodFieldList:
                    prodreport[field] = '{}'.format(product.find(attrs={"name" : field})['value'])
                print prodreport
    

output:

    <?xml version="1.0" encoding="utf-8"?>
    <items itemId="1">
     <parameter name="param1" value="A"/>
     <parameter name="param2" value="B"/>
     <product productid="test1">
      <parameter name="prodinfo1" value="Q"/>
      <parameter name="prodinfo2" value="R"/>
     </product>
     <product productid="test2">
      <parameter name="prodinfo1" value="S"/>
      <parameter name="prodinfo2" value="T"/>
     </product>
    </items> 

    {'param1': 'A'}
    {'param2': 'B', 'param1': 'A'}
    {'prodinfo1': 'Q', 'prodinfo2': 'R'}
    {'prodinfo1': 'S', 'prodinfo2': 'T'}

Upvotes: 2

Related Questions