Reputation: 77
I've got a problem with regular expressions in C#. I want to analyse the html code of a simple webpage. It look like this:
<td class="ivu_table_c_dep"> 12:05 </td>
<td class="ivu_table_c_line"> Bus 398 </td>
<td>
<img src="/IstAbfahrtzeiten/img/css/link.gif" alt="" />
<a class="catlink" href="http://mobil.bvg.de/Fahrinfo/bin/stboard.bin/dox?boardType=dep&input=S Mahlsdorf!&time=12:05&date=15.02.2012&&" title="interner Link: Information zu dieser Haltestelle">S Mahlsdorf</a>
What I want to know is "12:05", "Bus 398" and "S Mahlsdorf". With the first 2 parts I get it to work with the following code:
Regex HTMLTag = new Regex("ivu_table_c_dep\">([^<>]*)</td>([^<>]*)<td class=\"ivu_table_c_line\">([^<>]*)</td>");
But I don't get the 3. part. I tried to add "([^(\">)])([^<>])". But it doesnt't work.
Upvotes: 0
Views: 174
Reputation: 499352
Use the HTML Agility Pack to parse and query the HTML instead of Regex - see this answer for a compelling reasons why Regex is a poor solution to parsing HTML in general.
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature
Upvotes: 1
Reputation: 1839
It's ok to use regex as a quick and dirty solution when you know the structure of the text. After all, people around here clone objects by serializing and deserializing them... You'd be better with a small helper function, like this one:
static string gettext(string text, string tag, string cl) {
string re = string.Format(@"<\s*{0}[^>]+?class\s*=\s*[""']?{1}[^>]*>([^<]*)", tag, cl);
return Regex.Match(text, re).Groups[1].Value;
}
Fragile, it still can be used in simple cases, like yours. It extracts text (first text node, actually) from a given tag with a given class:
Console.WriteLine(gettext(text, "td", "ivu_table_c_dep")); // 12:05
Console.WriteLine(gettext(text, "td", "ivu_table_c_line")); // Bus 398
Console.WriteLine(gettext(text, "a", "catlink")); // S Mahlsdorf
Upvotes: 0