David Dury
David Dury

Reputation: 5717

Linq parse html string

I want to parse an html page and get a specific value from it. How can I do this using Linq or string parsing in C# ?

------------- MORE HTML ----------

     <span class="date">
        04.09.2012
    </span>
    <table cellspacing="0"><tr><th scope="row">1 EUR</th><td><span>**4,4907**</span></td><td><span class="rise">+0,0009</span></td><td><span class="rise">+0,02%</span></td></tr><tr><th scope="row">1 USD</th><td><span>3,5635</span></td><td><span class="fall">-0,0093</span></td><td><span class="fall">-0,26%</span></td></tr></table>

------------- MORE HTML ----------

I am interested in getting the value 4,4907 in bold!

Any idea how to achieve this?

Thanks!

Upvotes: 1

Views: 2613

Answers (3)

mortb
mortb

Reputation: 9849

Be careful when trying to parse HTML.

I think the obvious way would be to load it into an XDocument (as XML) but as HTML is often ambiguous or contains syntax errors this is bound to fail.

People here on Stack overflow have instead suggested to use http://htmlagilitypack.codeplex.com/ which is said to do a great job parsing html. Then you may use xpath to query your document for various contents.

Upvotes: 1

arutaku
arutaku

Reputation: 6087

You can try a regular expression in C# this way:

http://www.c-sharpcorner.com/UploadFile/prasad_1/RegExpPSD12062005021717AM/RegExpPSD.aspx

To find the string between "< span > * " and " * < / span >".

Or you can use an HTML parser like "jericho" and navigate through HTML tags to reach your value.

Upvotes: 0

AKX
AKX

Reputation: 168913

If you only need that bit, use a regular expression. (But don't use a regular expression to parse more complex HTML.)

<td><span>4,4907</span></td>

would be matched most conveniently by the regular expression

<td><span>([0-9,]+)</span></td> 

And see for example this quickly Googled page on how to use regexps with C#.

Upvotes: 4

Related Questions