zjm1126
zjm1126

Reputation: 66767

best way to convert the this html file into an xml file using python

this html is here :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><META http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>

    <div bgcolor="#48486c">

        <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" background="http://title.jpg" height="130">

            <tr height="129">

                <td width="719" height="129"></td>

                <td width="1" height="129"></td>

            </tr>

            <tr height="1">

                <td width="720" height="1"></td>

                <td width="1" height="1"></td>

            </tr>

        </table>

        <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" height="203">

            <tr height="20">

                <td width="719" height="20"></td>

                <td width="1" height="20"></td>

            </tr>

            <tr height="69">

                <td width="719" height="69" valign="top" align="left">

                    <table width="719" border="1" cellspacing="2" cellpadding="0">

                        <tr>

                            <td bgcolor="a5fdf8" width="390"><b>Stream Name</b></td>

                            <td bgcolor="a5fdf8" width="61"><b>Status</b></td>

                            <td bgcolor="a5fdf8" width="61"><b>Duration</b></td>

                            <td bgcolor="a5fdf8" width="185"><b>Start</b></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="390">c:\streams\ours\Sony_AVCHD_<WBR>Test_Discs_60Hz_00001.m2ts</td>

                            <td width="61"><font color="#D0D0D0">----</font></td>

                            <td width="61">00:00:02</td>

                            <td width="185">2010/06/15-15:06:17</td>

                        </tr>

                    </table>

                </td>

                <td width="1" height="69"></td>

            </tr>

            <tr height="113">

                <td width="720" height="113" colspan="2" valign="top" align="left">

                    <table width="721" border="1" cellspacing="2" cellpadding="0">

                        <tr bgcolor="a5fdf8">

                            <td width="299"><b>Test Category</b></td>

                            <td width="61"><b>Error</b></td>

                            <td width="62"><b>Warning</b></td>

                            <td width="275"><b>Details</b></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">All Tests (Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  ETSI TR-101-290 Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  ISO/IEC Transport Stream Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  System Data T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">  Prog(1)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">    VES(0xe0)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#1010F0">      H.264/AVC Conformance</font></td>

                            <td width="61"><font color="#ff0000">34718</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275">

                                <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_Conf.txt</font></a><br>

                            </td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Sequence</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Picture</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Slice</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Macroblock</font></td>

                            <td width="61"><font color="#ff0000">34718</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Block</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#1010F0">      HRD Tests</font></td>

                            <td width="61"><font color="#ff0000">69</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275">

                                <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_HRD.txt</font></a><br>

                            </td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        HRD level</font></td>

                            <td width="61"><font color="#ff0000">69</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">      Video T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">    AES(0xfd)</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#808080">      Audio Level Tests</font></td>

                            <td width="61"><font color="#808080">Disabled</font></td>

                            <td width="61"><font color="#808080">Disabled</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">      Audio T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                    </table>

                </td>

            </tr>

            <tr height="1">

                <td width="719" height="1"></td>

                <td width="1" height="1"></td>

            </tr>

        </table>

    </div>



</body></html>

has any python lib to do this ?

thanks

Upvotes: 10

Views: 26092

Answers (3)

wski
wski

Reputation: 325

To piggyback off @Alex Martelli, as of Python 2.5, there is an xml module that comes baked into the standard library:

https://docs.python.org/3.6/library/xml.html

You could strip all HTML tags off, then format into xml and use the baked in XML library instead of bringing in another dependency. This is only advisable if you trust the source of the XML as you would be susceptible to all the standard XML vulnerabilities.

Upvotes: 0

Ian Bicking
Ian Bicking

Reputation: 9932

lxml works well:

from lxml import html, etree

doc = html.fromstring(open('a.html').read())
out = open('a.xhtml', 'wb')
out.write(etree.tostring(doc))

Upvotes: 11

Alex Martelli
Alex Martelli

Reputation: 882103

BeautifulSoup gets you almost all the way there:

>>> import BeautifulSoup
>>> f = open('a.html')
>>> soup = BeautifulSoup.BeautifulSoup(f)
>>> f.close()
>>> g = open('a.xml', 'w')
>>> print >> g, soup.prettify()
>>> g.close()

This closes all tags properly. The only issue remaining is that the doctype remains HTML -- to change that into the doctype of your choice, you only need to change the first line, which is not hard, e.g., instead of printing the prettified text directly,

>>> lines = soup.prettify().splitlines()
>>> lines[0] = ('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"'
                '"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">')
>>> print >> g, '\n'.join(lines)

Upvotes: 13

Related Questions