C# analysing html code with Regex

Question

I've got a problem with regular expressions in C#. I want to analyse the html code of a simple webpage. It look like this:

 12:05 
 Bus 398 

 
    S Mahlsdorf

What I want to know is "12:05", "Bus 398" and "S Mahlsdorf". With the first 2 parts I get it to work with the following code:

Regex HTMLTag = new Regex("ivu_table_c_dep\">([^<>]*)([^<>]*)([^<>]*)");

But I don't get the 3. part. I tried to add "([^(\">)])([^<>])". But it doesnt't work.

user1096188 · Accepted Answer

It's ok to use regex as a quick and dirty solution when you know the structure of the text. After all, people around here clone objects by serializing and deserializing them... You'd be better with a small helper function, like this one:

static string gettext(string text, string tag, string cl) {
    string re = string.Format(@"<\s*{0}[^>]+?class\s*=\s*[""']?{1}[^>]*>([^<]*)", tag, cl);
    return Regex.Match(text, re).Groups[1].Value;
}

Fragile, it still can be used in simple cases, like yours. It extracts text (first text node, actually) from a given tag with a given class:

Console.WriteLine(gettext(text, "td", "ivu_table_c_dep"));  // 12:05
Console.WriteLine(gettext(text, "td", "ivu_table_c_line")); // Bus 398
Console.WriteLine(gettext(text, "a", "catlink"));           // S Mahlsdorf

C# analysing html code with Regex

Answers (2)

Related Questions