Reputation: 2403
Found a nasty bug in HTMLAgilityPack whereby some attribute values are NOT returned fully - they are truncated. Specifically, when attempting to get the href value out of an anchor tag, only the root domain is returned, anything following (the query string) is completely ignored. Anyone know a good workaround?
Example:
node.SelectSingleNode("//link").Attributes["href"].Value
returns https://www.example.com instead of returning https://www.example.com/mypage.php?_src=ffk_title&ffkid=66534&site=data:http%3A%2F%2Fwww.othersite.com%2Frss%2F
the link looks like so
<a class="tlink" href="https://www.example.com/mypage.php?_src=ffk_title&ffkid=66534&site=data:http%3A%2F%2Fwww.othersite.com%2Frss%2F" target="_blank">Click to get feed</a>
Anyway - right now, I'll just get the link tag and parse with old methods - I figure HTMLAgilityPack gets confused if there are atypical characters in the href tag. I hope it's just something I'm doing wrong, but this kind of quirk is really hurts.
Upvotes: 1
Views: 346
Reputation: 868
For anchor tags, you should use //a
XPath expression:
node.SelectSingleNode("//a").Attributes["href"].Value;
Additionally, if you need to reference an anchor with a particular class, you could use:
node.SelectSingleNode("//a[@class='tlink']").Attributes["href"].Value;
A working example can be seem here.
Upvotes: 3