Reputation: 66
I am trying to Get HTML from a site with html-agility-pack
private static void GetHtml()
{
var html = ".....";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var node = htmlDoc.DocumentNode.SelectSingleNode("//body");
string h = node.OuterHtml;
Console.WriteLine(h);
}
but where must be written data there is written 'Loading....'
how can I solve this problem?
[]
Upvotes: 2
Views: 532
Reputation: 6683
You are getting a "Loading" message because this is what the original Html source of page contains. After the document is loaded in your browser, new content is generated by scripts running on the page. But HtmlAgilityPack can't see that. HtmlAgilityPack was created as a library for parsing Html.
Update: Latest versions of HtmlAgilityPack are now able to run a WebBrowser
(System.Windows.Forms) in background and execute Javascript code on the page by calling LoadFromBrowser()
method. The newly dynamically generated Html can then be scraped from resulting page. See http://html-agility-pack.net/from-browser.
Upvotes: 2
Reputation: 66
thank you for answer. you are true. this problem is because javascript not run.
I have already solved this problem using geckoFX
geckoWebBrowser1.Navigate("google.com");
GeckoHtmlElement element = null;
var geckoDomElement = geckoWebBrowser1.Document.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
var innerHtml = element.InnerHtml;
using (FileStream fs = new FileStream(@"" + "aaa" + ".html", FileMode.Create))
{
using (StreamWriter w = new StreamWriter(fs, Encoding.UTF8))
{
w.WriteLine(innerHtml);
}
}
}
Upvotes: 0