Reputation: 121
I'm using Delphi's TWebBrowser component to load up some web pages that I want to parse, and they use javascript (AJAX?) to render the user-visible HTML code. The well-documented methods of extracting the HTML from such pages returns a bunch of javascript rather than what the user sees. There are responses to queries here that go back to 2004 and they all return javascript rather than the user-visible HTML. I've seen a couple that suggest alternate ways to access the data, but I have not been able to get any of them to work, nor am I sure how to adapt the code.
My question is, when I load a web page into a TWebBrowser that's perfectly readable after being rendered inside of the TWebBrowser component, how can I extract the HTML that's ultimately rendered inside of that component that makes it visible, rather than the JS code that generates it?
In my case, I'm trying to load a Google Search Result page, but I've heard this is also an issue in lots of news sites like Wall Street Journal, WAPO, and NYTimes.
var
url: string;
d: OleVariant;
begin
// enter something like "dentist in baltimore" in a Google search,
// then copy the contents of the ADDRESS field that it generates and
// paste it here:
url := '... paste URL Google generates here ...';
WebBrowser1.Navigate2( url, 0 {nav_flags} );
// I have an OnNavigate2 handler here, but I'm guessing this works as well
d := WebBrowser1.Document;
memo1.Lines.Text := d.documentElement.outerHTML;
The problem is, the memo contains ... and it's just a bunch of javascript in the HEAD. There's nothing there that resembles what's visible in the TWebBrowser or browser window that this search actually displays to the user.
Upvotes: 1
Views: 308
Reputation: 121
Someone in another forum suggested it's a timing issue, and to replace the OnNavigationComplete2 that I'm using with OnDocumentComplete. I've actually never seen or heard of OnDocumentComplete, nor have I seen it used in any examples. Certainly none that have been simplified to show everything inline so there are no timing issues that can occur.
But it turns out that this was the crux of the problem in this case, not outerHTML: you need to call an event that's triggered after all of the javascript has finished running, and I believed that the OnNavigationComplete2 did that. My bad.
Upvotes: 1