Basj
Basj

Reputation: 46463

BeautifulSoup as XML parser produces an unwanted html/body

When using BeautifulSoup for XML:

import bs4
soup = bs4.BeautifulSoup('<?xml version="1.0" encoding="utf-8"?><mydocument><b></b></mydocument>', 'lxml')
# add or remove tags in soup
print(soup)

the output has an unnecessary <html> and <body>:

<?xml version="1.0" encoding="utf-8"?><html><body><mydocument><b></b></mydocument></body></html>

How to avoid these HTML-specific elements and output an XML with BeautifulSoup?

This is not a valid solution:

print(soup.find('mydocument'))

because it removes the <?xml version="1.0" encoding="utf-8"?>, which I want to keep.

Upvotes: 1

Views: 50

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24930

Try one of these:

my_xml = '<?xml version="1.0" encoding="utf-8"?><mydocument><b></b></mydocument>'
soup = bs4.BeautifulSoup(my_xml, "xml")

or

soup = bs4.BeautifulSoup(my_xml, "lxml-xml")

in either case print(soup) should output:

<?xml version="1.0" encoding="utf-8"?>
<mydocument><b/></mydocument>

Upvotes: 1

Related Questions