Reputation: 765
I'm using htmlagilitypack to extract several html-tags. Heres what I do:
HtmlDoc = new HtmlDocument();
StringReader sr = new StringReader(decodedHTML);
HtmlDoc.Load(sr);
sr.close();
var anchor_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_ANCHOR + "[@" + HTML.ATTRIBUT_HREF + "]");
var embed_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_EMBED + "[@" + HTML.TAG_EMBED_SRC + "]");
var iframe_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IFRAME + "[@" + HTML.TAG_IFRAME_SRC + "]");
var img_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_IMG + "[@" + HTML.TAG_IMG_SRC + "]");
var audio_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_AUDIO); // may contain inner-html
var object_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_OBJECT); // may contain inner-html
var video_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_VIDEO); // may contain inner-html
Where decodedHTML is the html-page packed in a string. After that I examine if the variables above are null
if (anchor_tags != null)
{
ExtractLinks_AnchorTags(anchor_tags);
}
if(audio_tags != null)
{
ExtractLinks_AudioTags(audio_tags);
}
if(embed_tags!=null)
{
ExtractLinks_EmbedTags(embed_tags);
}
if (iframe_tags != null)
{
ExtractLinks_iFrameTags(iframe_tags);
}
if (img_tags != null)
{
ExtractLinks_ImgTags(img_tags);
}
if (object_tags != null)
{
ExtractLinks_ObjectTags(object_tags);
}
if (video_tags != null)
{
ExtractLinks_ObjectTags(video_tags);
}
and some of them are definitly null, because most of the extractLinks-methods aren't even called. For example when I'm visiting y o u t u b e . c o m . There are several iframe-tags and the code doesnt recognize them.
edit:
when I'm deleting the
"[@" + HTML.TAG_IFRAME_SRC + "]"
the iframes are recognized, but I just want to extract the iframes with a src attribute. What's the correct xpath syntax for it?
Upvotes: 1
Views: 2536
Reputation: 40546
HtmlAgilityPack does not load the contents of iframe
elements.
In order to inspect the content of an iframe
, read the src
attribute (which represents the iframe
's URI) and perform a separate web request to load that into a separate HtmlDocument
.
Along the way, be aware of these possible issues:
the src
attribute may contain a relative URI. For example, if you visit http://www.example.com
and see that an iframe
has src="/samplePage"
, you should convert that first to an absolute URI (in this case, http://www.example.com/samplePage
)
it is possible that some iframe
elements do not have a src
tag, because it is added dynamically, via javascript, when the document is rendered in a browser. It is also possible to create entire iframe
elements with javascript, elements that you wouldn't even see if you just do a regular HttpWebRequest
. In cases like these, you have to analyze the javascript present on the page and to duplicate that logic in your program.
The XPath expression for iframe
elements that have a src
attribute is: //iframe[@src]
Upvotes: 1