genespos
genespos

Reputation: 3311

InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack

I need to extract text from a very bad Html.

I'm trying to do this using vb.net and HtmlAgilityPack

The tag that I need to parse has InnerText = InnerHtml and both:

Name:<!--b>&#61;</b--> Albert E<!--span-->instein  s<!--i>&#89;</i-->ection: 3 room: -

While debuging I can read it using "Html viewer": it shows:

Name: Albert Einstein section: 3 room: -

How can I get this into a string variable?

EDIT:

I use this code to get the node:

Dim ElePs As HtmlNodeCollection = _
    mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
Next

Upvotes: 1

Views: 2200

Answers (1)

Xi Sigma
Xi Sigma

Reputation: 2372

If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join is enough:

C#

var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
                                            Select(t=>t.InnerText));

VB.net

 Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
                                   Select t.InnerText)

the html is valid, nothing bad about it, its just written by someone without a soul.

based on your update this shall do:

Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
     Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
                Select t.InnerText).Trim()
Next

note the .// it means that it will look for the descendant nodes of the current node unlike // which will always start from the top node.

Upvotes: 2

Related Questions