Reputation: 14467
Here's some lines of the document:
<div class="rowleft">
<h3>Technical Fouls</h3>
<table class="num-left">
<tr class="datahl2b">
<td> </td>
<td>Players</td>
</tr>
<tr>
<td>DAL</td>
<td>
None</td>
</tr>
<tr>
<td>MIA</td>
<td>
Mike Miller</td>
<td>
Mike Miller, Jr.</td>
</tr>
</table>
</div>
I'm interested in extracting the None
and Mike Miller
and Mike Miller, Jr.
from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.
One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>
, seeing which lines contain data (probably using StartsWith()
), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.
Upvotes: 2
Views: 143
Reputation: 22703
Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.
If your document is not well-formed XML, I would recommend using the HTML Agility Pack
Upvotes: 0
Reputation: 28884
HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack
Upvotes: 3