user1570048
user1570048

Reputation: 880

parsing HTML with RegularExpression stuck in something

Here is my regular expression

Dim TableHeaderExpression As String = "<th[^>]*>(.*?)</th>"

and here is my HTML

<th class="seller-col">
 <b>Relevanz</b>
 <span class="ps-sprite ps-sprite-sortdw" title=""></span>
 </th>

this expression gives me everything inside the th Tag so it outputs

<b>Relevanz</b>
     <span class="ps-sprite ps-sprite-sortdw" title=""></span>

but how i make it output only

Relevanz

meaning ignore all the text inside <th> except for whats inside <b>

Upvotes: 0

Views: 47

Answers (1)

Oded
Oded

Reputation: 499212

Instead of using Regex for parsing HTML (not the best option), use the HTML Agility Pack to parse and query the HTML.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Upvotes: 1

Related Questions