Guillaume Lebourgeois
Guillaume Lebourgeois

Reputation: 3873

How to convert raw html from the web into parsable xml in Python

I thought BeautifulSoup could do that, but it does not seem to do the trick.

What method have you already used, and is long term reliable ?

Upvotes: 2

Views: 301

Answers (2)

Scharron
Scharron

Reputation: 17797

You can try http://utidylib.berlios.de/ , a python wrapper for tidy library. Tidy works well in most cases.

For something more robust (or at least more browser-like), I guess you could try webkit or gecko. I'm not sure the wrappers responsible for cleaning HTML are available, but you can have a look.

Upvotes: 2

supakeen
supakeen

Reputation: 2924

You could use the lxml library, specifically lxml.html which gives you an ETree object which you can then serialize as XML with (amongst others) the .tostring() method.

If this fails on your HTML (it is too broken) you can use ElementSoup (an extension on BeautifulSoup) to build a lxml.html tree.

Upvotes: 4

Related Questions