Hui
Hui

Reputation: 14467

Extracting data from an XML document without using an XML parser

Here's some lines of the document:

  <div class="rowleft">
    <h3>Technical Fouls</h3>

    <table class="num-left">
      <tr class="datahl2b"> 
        <td>&nbsp;</td>
            <td>Players</td>
          </tr>
          <tr> 
            <td>DAL</td>
            <td>
              None</td>

          </tr>
          <tr> 
            <td>MIA</td>
            <td>
              Mike Miller</td>
            <td>
              Mike Miller, Jr.</td>
          </tr>
        </table>
    </div> 

I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.

One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.

Upvotes: 2

Views: 143

Answers (2)

Sven
Sven

Reputation: 22703

Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.

If your document is not well-formed XML, I would recommend using the HTML Agility Pack

Upvotes: 0

Paul Creasey
Paul Creasey

Reputation: 28884

Relevant

HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack

Upvotes: 3

Related Questions