fizmhd
fizmhd

Reputation: 536

How to extract text inside a div tag using htmlagilitypack

I want to extract the text "Some text goes here" between the div class. I am using html agility pack, and c#

<div class="productDescriptionWrapper">
Some Text Goes here...
<div class="emptyClear"> </div>
</div>

this is what I have :

Description = doc.DocumentNode.SelectNodes("//div[@class=\"productDescriptionWrapper\").Descendants("div").Select(x => x.InnerText).ToList();

I get this error :

An unhandled exception of type 'System.NullReferenceException' 

I know how to extract if the text is b/w a <h1> or <p> instead of "div" in Descendants i will have to give "h1" or "p".

Somebody please assist.

Upvotes: 0

Views: 3730

Answers (2)

har07
har07

Reputation: 89335

There is no way you can get null reference exception given doc is created from HTML snippet you posted. Anyway, if you meant to get text within the outer <div>, but not from the inner one, then use xpath /text() which mean get direct child text nodes.

For example, given this HTML snippet :

var html = @"<div class=""productDescriptionWrapper"">
Some Text Goes here...
<div class=""emptyClear"">Don't get this one</div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);

..this expression return text from the outer <div> only :

var Description = doc.DocumentNode
                     .SelectNodes("//div[@class='productDescriptionWrapper']/text()")
                     .Select(x => x.InnerText.Trim())
                     .First();
//Description : 
//"Some Text Goes here..."

..while in contrast, the following return all the text :

var Description = doc.DocumentNode
                     .SelectNodes("//div[@class='productDescriptionWrapper']")
                     .Select(x => x.InnerText.Trim())
                     .First();
//Description :
//"Some Text Goes here...
//Don't get this one"

Upvotes: 1

Xi Sigma
Xi Sigma

Reputation: 2372

Use single quotes such as

//div[@class='productDescriptionWrapper']

to get all descendants of all types use:

//div[@class='productDescriptionWrapper']//*,

to get all descendants of a specific type such as a p then use //div[@class='productDescriptionWrapper']//p.

to get all descendants that are either a div or a p:

//div[@class='productDescriptionWrapper']//*[self::div or self::p] 

say you wanted to get all non blank descendant text nodes then use:

//div[@class='productDescriptionWrapper']//text()[normalize-space()]

Upvotes: 1

Related Questions