Reputation: 7679
Is encoding/xml the best library to parse HTML table files like this one and exist some examples how to do it?
<html><head>
<meta charset="utf-8">
</head>
<body>
<a name="Test1">
<center>
<b>Test 1</b> <table border="0">
<tbody><tr>
<th> Type </th>
<th> Region </th>
</tr>
<tr>
<td> <table border="0">
<thead>
<tr>
<th><b>Type</b></th>
<th> </th>
<th> Count </th>
<th> Percent </th>
</tr>
</thead>
<tbody><tr>
<td> <b>T1</b> </td>
<th> </th>
<td class="numeric" bgcolor="#ff0000"> 34,314 </td>
<td class="numeric" bgcolor="#ff0000"> 31.648% </td>
</tr>
<tr>
<td> <b>T2</b> </td>
<th> </th>
<td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
<td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
</tr>
<tr>
<td> <b>T3</b> </td>
<th> </th>
<td class="numeric" bgcolor="#24da00"> 4,871 </td>
<td class="numeric" bgcolor="#24da00"> 4.493% </td>
</tr>
</tbody></table><br>
</td>
<td> <table border="0">
<thead>
<tr>
<th><b> Type</b></th>
<th> </th>
<th> Count </th>
<th> Percent </th>
</tr>
</thead>
<tbody><tr>
<td> <b>T4</b> </td>
<th> </th>
<td class="numeric" bgcolor="#ff0000"> 34,314 </td>
<td class="numeric" bgcolor="#ff0000"> 31.648% </td>
</tr>
<tr>
<td> <b>T5</b> </td>
<th> </th>
<td class="numeric" bgcolor="#53ab00"> 11,187 </td>
<td class="numeric" bgcolor="#53ab00"> 10.318% </td>
</tr>
<tr>
<td> <b>T6</b> </td>
<th> </th>
<td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
<td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
</tr>
</tbody></table><br>
</td>
</tr>
</tbody></table>
</center>
</a>
</body></html>
Thank you in advance.
Upvotes: 1
Views: 503
Reputation: 55463
Depends on your HTML.
Strictly speaking, the only one kind of HTML which is guaranteed to be parsed by a conforming XML parser is XHTML, but despite the fact XHTML once has been thought of as coming to be the HTML standard, it has not really taken off the ground and these days it's considered obsolete (in favor of the much hyped "HTML5" thing and all the ecosystem around it). The basic problem with HTML is that while it looks like XML it has different rules. One glaring distinction is that <br>
is a perfectly legal HTML but is an unterminated element in XML (in the latter, it has to be spelled <br/>
), and there are a lot more differences.
On the other hand, your particular example looks quite XML'ish to me, so if you can guarantee your data, while being HTML, will always be a well-formed XML at the same time, you can just use the encoding/xml
package. Otherwise go for go.net/html
, as suggested by @elithrar, or find some other package.
Upvotes: 4