user1000698
user1000698

Reputation: 77

C# analysing html code with Regex

I've got a problem with regular expressions in C#. I want to analyse the html code of a simple webpage. It look like this:

<td class="ivu_table_c_dep"> 12:05 </td>
<td class="ivu_table_c_line"> Bus 398 </td>
<td>
<img src="/IstAbfahrtzeiten/img/css/link.gif" alt="" />&nbsp;
    <a class="catlink" href="http://mobil.bvg.de/Fahrinfo/bin/stboard.bin/dox?boardType=dep&input=S Mahlsdorf!&time=12:05&date=15.02.2012&&amp;" title="interner Link: Information zu dieser Haltestelle">S Mahlsdorf</a>

What I want to know is "12:05", "Bus 398" and "S Mahlsdorf". With the first 2 parts I get it to work with the following code:

Regex HTMLTag = new Regex("ivu_table_c_dep\">([^<>]*)</td>([^<>]*)<td class=\"ivu_table_c_line\">([^<>]*)</td>");

But I don't get the 3. part. I tried to add "([^(\">)])([^<>])". But it doesnt't work.

Upvotes: 0

Views: 174

Answers (2)

Oded
Oded

Reputation: 499352

Use the HTML Agility Pack to parse and query the HTML instead of Regex - see this answer for a compelling reasons why Regex is a poor solution to parsing HTML in general.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Upvotes: 1

user1096188
user1096188

Reputation: 1839

It's ok to use regex as a quick and dirty solution when you know the structure of the text. After all, people around here clone objects by serializing and deserializing them... You'd be better with a small helper function, like this one:

static string gettext(string text, string tag, string cl) {
    string re = string.Format(@"<\s*{0}[^>]+?class\s*=\s*[""']?{1}[^>]*>([^<]*)", tag, cl);
    return Regex.Match(text, re).Groups[1].Value;
}

Fragile, it still can be used in simple cases, like yours. It extracts text (first text node, actually) from a given tag with a given class:

Console.WriteLine(gettext(text, "td", "ivu_table_c_dep"));  // 12:05
Console.WriteLine(gettext(text, "td", "ivu_table_c_line")); // Bus 398
Console.WriteLine(gettext(text, "a", "catlink"));           // S Mahlsdorf

Upvotes: 0

Related Questions