Reputation: 2234
I am tring to parse an xml using BeautifulSOup, but it results in improper output.
file.xml:
<?xml version="1.0" ?>
<opening name="value1" >
<element name="value1.1"/>
<element name="value1.2">
<element name="1.2.1"/>
</element>
<element name="value1.3">
<element name="value1.3.1"/>
</element>
</opening>
using following code:
>>> a=open('file.xml').read()
>>> import BeautifulSoup
>>> s= BeautifulSoup.BeautifulSoup(a)
>>> print s.prettify()
and I get following output:
<?xml version='1.0' encoding='utf-8'?>
<opening name="value1">
<element name="value1.1">
</element>
<element name="value1.2">
</element>
<element name="1.2.1">
</element>
<element name="value1.3">
</element>
<element name="value1.3.1">
</element>
</opening>
Why does is shows all the element as child of opening tag ? How do I parse this file properly?
I've tried using s= BeautifulSoup.BeautifulStoneSoup(a) also but this also didn't work.
Upvotes: 0
Views: 124
Reputation: 73
Beautiful Soup 3 requires a special argument to get tags to close properly. You need the selfClosingTags argument to the BeautifulStoneSoup constructor. Use something like:
soup = BeautifulStoneSoup(markup, selfClosingTags=['element'])
Upvotes: 0
Reputation: 142106
BeautifulSoup
is primarily an HTML
parser that tries it best to deal with mal-formed HTML. There are XML libraries out there such as lxml
which I highly recommend - try that.
An example:
import lxml.etree
xml = """<?xml version="1.0" ?>
<opening name="value1" >
<element name="value1.1"/>
<element name="value1.2">
<element name="1.2.1"/>
</element>
<element name="value1.3">
<element name="value1.3.1"/>
</element>
</opening>
"""
r = lxml.etree.fromstring(xml)
r.xpath('//element/@name')
# ['value1.1', 'value1.2', '1.2.1', 'value1.3', 'value1.3.1']
Upvotes: 1