user977828
user977828

Reputation: 7679

Parsing HTML files with Go

Is encoding/xml the best library to parse HTML table files like this one and exist some examples how to do it?

<html><head>
<meta charset="utf-8">

</head>
<body>
<a name="Test1">
<center>
<b>Test 1</b> <table border="0">
  <tbody><tr>
  <th> Type </th>
  <th> Region </th>
  </tr>
  <tr>
  <td> <table border="0">
  <thead>
  <tr>
    <th><b>Type</b></th>
    <th> &nbsp; </th>
    <th> Count </th>
    <th> Percent </th>
  </tr>
  </thead>
  <tbody><tr>
    <td> <b>T1</b> </td>
    <th> &nbsp; </th>
    <td class="numeric" bgcolor="#ff0000"> 34,314 </td>
    <td class="numeric" bgcolor="#ff0000"> 31.648% </td>
  </tr>
  <tr>
    <td> <b>T2</b> </td>
    <th> &nbsp; </th>
    <td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
    <td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
  </tr>
  <tr>
    <td> <b>T3</b> </td>
    <th> &nbsp; </th>
    <td class="numeric" bgcolor="#24da00"> 4,871 </td>
    <td class="numeric" bgcolor="#24da00"> 4.493% </td>
  </tr>

</tbody></table><br>
</td>
  <td> <table border="0">
  <thead>
  <tr>
    <th><b> Type</b></th>
    <th> &nbsp; </th>
    <th> Count </th>
    <th> Percent </th>
  </tr>
  </thead>
  <tbody><tr>
    <td> <b>T4</b> </td>
    <th> &nbsp; </th>
    <td class="numeric" bgcolor="#ff0000"> 34,314 </td>
    <td class="numeric" bgcolor="#ff0000"> 31.648% </td>
  </tr>
  <tr>
    <td> <b>T5</b> </td>
    <th> &nbsp; </th>
    <td class="numeric" bgcolor="#53ab00"> 11,187 </td>
    <td class="numeric" bgcolor="#53ab00"> 10.318% </td>
  </tr>
  <tr>
    <td> <b>T6</b> </td>
    <th> &nbsp; </th>
    <td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
    <td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
  </tr>

</tbody></table><br>
</td>
  </tr>
</tbody></table>
</center>

  </a>
</body></html>

Thank you in advance.

Upvotes: 1

Views: 503

Answers (1)

kostix
kostix

Reputation: 55463

Depends on your HTML.

Strictly speaking, the only one kind of HTML which is guaranteed to be parsed by a conforming XML parser is XHTML, but despite the fact XHTML once has been thought of as coming to be the HTML standard, it has not really taken off the ground and these days it's considered obsolete (in favor of the much hyped "HTML5" thing and all the ecosystem around it). The basic problem with HTML is that while it looks like XML it has different rules. One glaring distinction is that <br> is a perfectly legal HTML but is an unterminated element in XML (in the latter, it has to be spelled <br/>), and there are a lot more differences.

On the other hand, your particular example looks quite XML'ish to me, so if you can guarantee your data, while being HTML, will always be a well-formed XML at the same time, you can just use the encoding/xml package. Otherwise go for go.net/html, as suggested by @elithrar, or find some other package.

Upvotes: 4

Related Questions