Nika Javakhishvili
Nika Javakhishvili

Reputation: 462

htmlagilitypack InnerText bug

I am trying to parse a website and get some content. This is my code :

doc.DocumentNode.SelectSingleNode("//div[@class='article-content']").InnerText

I just need text but the result is something like this : some text... and this array :

( [0] => 39 [1] => 6 [2] => 10 [3] => 9 [4] => 13 [5] => 5 [6] => 7 [7] => 12 [8] => 11 [9] => 8 [10] => 14 [11] => 82 ) [archtoday] => 0 [hour] => 09:00 [autoarchive] => 1 [autoarchivereset] => 1 [show_description] => 0 [num_desc_words] => 10 [show_description_image] => 0 [num_leading_articles] => 0

I've tried:

HtmlEntity.DeEntitize(doc.DocumentNode.SelectSingleNode("//div[@class='article-content']").InnerText)

But the result is same link : http://www.interpressnews.ge/ge/politika/353565-barak-obamas-thanashemtse-rusethma-saqarthveloshi-gankhorcielebuli-intervenciis-dros-mighebuli-gakvethilebi-aithvisa.html

div :

<div class="article-content">

Upvotes: 0

Views: 289

Answers (1)

Chris
Chris

Reputation: 27609

The thing to note about inner text is that it will get you the text content of the node but doesn't care about CSS or anything else that effects how the web page itself appears. This means that if there is a node with display css property set to none then the HTML parser doesn't care, it will show you the text of that node anyway. This is exactly what is happening here.

http://www.interpressnews.ge/ge/politika/353565-barak-obamas-thanashemtse-rusethma-saqarthveloshi-gankhorcielebuli-intervenciis-dros-mighebuli-gakvethilebi-aithvisa.html is the page you mentioned in comments. If you view the source of the page (ctrl-u in chrome and I think firefox, not sure of a shortcut in IE) then look for article-content in the page you will find the article and see that it also has a <div style="display:none;"> inside it which contains the strange text you are seeing. This is therefore not a bug in the html agility pack.

You will need to analyze the page and write more complex code to work out exactly what you want to extract from the page.

Upvotes: 2

Related Questions