MC9000
MC9000

Reputation: 2403

Bug in HTMLAgilityPack when getting href attribute value. C#

Found a nasty bug in HTMLAgilityPack whereby some attribute values are NOT returned fully - they are truncated. Specifically, when attempting to get the href value out of an anchor tag, only the root domain is returned, anything following (the query string) is completely ignored. Anyone know a good workaround?

Example:

node.SelectSingleNode("//link").Attributes["href"].Value

returns https://www.example.com instead of returning https://www.example.com/mypage.php?_src=ffk_title&ffkid=66534&site=data:http%3A%2F%2Fwww.othersite.com%2Frss%2F

the link looks like so

<a class="tlink" href="https://www.example.com/mypage.php?_src=ffk_title&amp;ffkid=66534&amp;site=data:http%3A%2F%2Fwww.othersite.com%2Frss%2F" target="_blank">Click to get feed</a>

Anyway - right now, I'll just get the link tag and parse with old methods - I figure HTMLAgilityPack gets confused if there are atypical characters in the href tag. I hope it's just something I'm doing wrong, but this kind of quirk is really hurts.

Upvotes: 1

Views: 346

Answers (1)

Alan Lacerda
Alan Lacerda

Reputation: 868

For anchor tags, you should use //a XPath expression:

node.SelectSingleNode("//a").Attributes["href"].Value;

Additionally, if you need to reference an anchor with a particular class, you could use:

node.SelectSingleNode("//a[@class='tlink']").Attributes["href"].Value;

A working example can be seem here.

Upvotes: 3

Related Questions