Reputation: 46463
When using BeautifulSoup for XML:
import bs4
soup = bs4.BeautifulSoup('<?xml version="1.0" encoding="utf-8"?><mydocument><b></b></mydocument>', 'lxml')
# add or remove tags in soup
print(soup)
the output has an unnecessary <html>
and <body>
:
<?xml version="1.0" encoding="utf-8"?><html><body><mydocument><b></b></mydocument></body></html>
How to avoid these HTML-specific elements and output an XML with BeautifulSoup?
This is not a valid solution:
print(soup.find('mydocument'))
because it removes the <?xml version="1.0" encoding="utf-8"?>
, which I want to keep.
Upvotes: 1
Views: 50
Reputation: 24930
Try one of these:
my_xml = '<?xml version="1.0" encoding="utf-8"?><mydocument><b></b></mydocument>'
soup = bs4.BeautifulSoup(my_xml, "xml")
or
soup = bs4.BeautifulSoup(my_xml, "lxml-xml")
in either case print(soup)
should output:
<?xml version="1.0" encoding="utf-8"?>
<mydocument><b/></mydocument>
Upvotes: 1