Majlik
Majlik

Reputation: 1082

Ignoring   when parsing with HtmlAgilityPack

I'm parsing html table in c# using Html Agility Pack that contains non-breaking space.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);

Where page is string containing table with special characters   within text.

<td>&#160;test</td>
<td>number =&#160;123&#160;</td>

Using SelectSingleNode(".//td").InnerText will contains this special characters but i want to ignore them.

Is there some elegant way to ignore this (with or without help of Html Agility Pack) without modifying source table?

Upvotes: 1

Views: 2197

Answers (2)

Ben
Ben

Reputation: 35643

The "Special Character" non-breaking-space of which you speak is a valid character which can perfectly legitimately appear in text, just as "fancy quotes", em-dash etc can.

Often we want to treat certain characters as being equivalent.

  • So you might want to treat an em-dash, en-dash and minus sign/dash as being the same.
  • Or fancy quotes as the same as straight quotes.
  • Or the non-breaking-space as an ordinary space.

However this is not something HTML Agility pack can help with. You need to use something like string.Replace or your own canonicalization function to do this.

I would suggest something like:

static string CleanupStringForMyApp(string s){
    // replace characters with their equivalents
    s = s.Replace(string.FromCharCode(160), " ");
    // Add any more replacements you want to do here
    return s;
}

Upvotes: 0

DGibbs
DGibbs

Reputation: 14618

You could use HtmlDecode

string foo = HttpUtility.HtmlDecode("Special char: &#160;");

Will give you a string:

Special char:

Upvotes: 2

Related Questions