Reputation: 465
I am trying to parse XML from a website. I have no control over the content if it’s not formatted properly. A very simplified example of the XML data is below.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<items:items itemId="1">
<parameter name="param1" value="A"/>
<parameter name="param2" value="B"/>
<product productid="test1">
<parameter name="prodinfo1" value="Q"/>
<parameter name="prodinfo2" value="R"/>
</product>
<product productid="test2">
<parameter name="prodinfo1" value="S"/>
<parameter name="prodinfo2" value="T"/>
</product>
</items:items>
<items:items itemId="2">
<parameter name="param1" value="C"/>
<parameter name="param2" value="D"/>
<product productid="test3">
<parameter name="prodinfo1" value="U"/>
<parameter name="prodinfo2" value="V"/>
</product>
<product productid="test4">
<parameter name="prodinfo1" value="W"/>
<parameter name="prodinfo2" value="X"/>
</product>
</items:items>
I wrote a short Python 2.7 script using BeautifulSoup 3.2.1 to parse the XML (I am constrained to using these versions, so unfortunately upgrading is not an option).
from BeautifulSoup import BeautifulStoneSoup
def main():
fieldList = ('param1','param2')
prodFieldList = ('prodinfo1','prodinfo2')
xmlfile = 'test.xml'
xmldata = open(xmlfile).read()
soup = BeautifulStoneSoup(xmldata)
print soup.prettify()
for message in soup.findAll('items:items', recursive=False):
report = {}
for field in fieldList:
report[field] = '{}'.format(message.find(attrs={"name" : field})['value'])
for product in message.findAll('product', recursive=False):
prodreport = {}
for field in prodFieldList:
prodreport[field] = '{}'.format(product.find(attrs={"name" : field})['value'])
if __name__ == "__main__":
main()
For some reason, the parameters within <product></product>
such as prodinfo1 and prodinfo2 do not show up. When I look at the output from soup.prettify()
, rather than indenting as displayed in my XML file above, I can see that the product parameters are being listed outside the <product></product>
tags, and thus their identity with a particular product is lost:
<?xml version='1.0' encoding='utf-8'?>
<items:items itemid="1">
<parameter name="param1" value="A">
</parameter>
<parameter name="param2" value="B">
<product productid="test1">
</product>
</parameter>
<parameter name="prodinfo1" value="Q">
</parameter>
<parameter name="prodinfo2" value="R">
<product productid="test2">
</product>
</parameter>
<parameter name="prodinfo1" value="S">
</parameter>
<parameter name="prodinfo2" value="T">
</parameter>
</items:items>
<items:items itemid="2">
<parameter name="param1" value="C">
</parameter>
<parameter name="param2" value="D">
<product productid="test3">
</product>
</parameter>
<parameter name="prodinfo1" value="U">
</parameter>
<parameter name="prodinfo2" value="V">
<product productid="test4">
</product>
</parameter>
<parameter name="prodinfo1" value="W">
</parameter>
<parameter name="prodinfo2" value="X">
</parameter>
</items:items>
I have been searching but haven’t found anyone with the same problem. Why is this happening, and what can I do to properly parse this XML? Thank you for your time.
Upvotes: 2
Views: 952
Reputation: 56
It works for me after making 3 changes:
0) I'm using bs4 (this is the only version I have installed)
1) BeautifulSoup(xmldata, features="xml")
instead of BeautifulStoneSoup(xmldata)
, BeautifulStoneSoup
is depreciated in bs4
2) I changed soup.findAll('items:items', recursive=False)
to soup.findAll(True, {"itemId":True}, recursive=False)
from bs4 import BeautifulSoup
xmldata = #load your data
if __name__ == "__main__":
fieldList = ('param1','param2')
prodFieldList = ('prodinfo1','prodinfo2')
soup = BeautifulSoup(xmldata, features="xml")# <- notice this
print soup.prettify(), "\n"
for message in soup.findAll(True, {"itemId":True}, recursive=False):# <- and this
report = {}
for field in fieldList:
report[field] = '{}'.format(message.find(attrs={"name" : field})['value'])
print report
for product in message.findAll('product', recursive=False):
prodreport = {}
for field in prodFieldList:
prodreport[field] = '{}'.format(product.find(attrs={"name" : field})['value'])
print prodreport
output:
<?xml version="1.0" encoding="utf-8"?>
<items itemId="1">
<parameter name="param1" value="A"/>
<parameter name="param2" value="B"/>
<product productid="test1">
<parameter name="prodinfo1" value="Q"/>
<parameter name="prodinfo2" value="R"/>
</product>
<product productid="test2">
<parameter name="prodinfo1" value="S"/>
<parameter name="prodinfo2" value="T"/>
</product>
</items>
{'param1': 'A'}
{'param2': 'B', 'param1': 'A'}
{'prodinfo1': 'Q', 'prodinfo2': 'R'}
{'prodinfo1': 'S', 'prodinfo2': 'T'}
Upvotes: 2