Parse badly formatted HTML for table data

Question

I'm writing a c# console application to retrieve table info from an external html web page.

I want to extract all records for data,match,opponent,result etc - 23 rows in example link above.

I've no control of this web page which unfortunately isn't well formatted so options I've tried like the HtmlAgilityPack and XML parsing simply fail. I have also tried a number for RegEx's but my knowledge of this is extremely poor, an example I tried below:

string[] trs = Regex.Matches(html, 
                             @"]*>(?.*)", 
                             RegexOptions.Multiline)
                    .Cast()
                    .Select(t => t.Groups["content"].Value)
                    .ToArray();

This returns a complete list of all 's (with many records I don't need) but I'm then unable to get the data from this.

UPDATE

Here is an example of the use of HtmlAgilityPack I tried:

 HtmlDocument doc = new HtmlDocument();

        doc.LoadHtml(html);
        foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
        {

            foreach (HtmlNode row in table.SelectNodes("tr"))
            {
                foreach (HtmlNode cell in row.SelectNodes("td"))
                {
                    Console.WriteLine(cell.InnerText);
                }
            }
        }

Simon Whitehead · Accepted Answer

I think you just need to fix your HtmlAgilityPack attempt. This works fine for me:

// Skip the first table on that page so we just get results
foreach (var table in doc.DocumentNode.SelectNodes("//table").Skip(1).Take(1)) {
    foreach (var td in table.SelectNodes("//td")) {
        Console.WriteLine(td.InnerText);
    }
}

This dumps a heap of data from the results table, one columns per line, to the console.

Parse badly formatted HTML for table data

Answers (2)

Related Questions