Reputation: 636
How do I obtain the rendered HTML from an URL?
Let's say I want to have this program that checks for updates on the web in the form of news, schedules and other dynamic content (Content not available in the HTML source)
How do I get the rendered (full) HTML containing the full document as if you were reading it through a browser?
The following example is an example of the dynamic page:
As always when it comes to dynamic sites - the above text is nowhere to be found in the source code - only visible through a browser.
Of course I can download the HTML page using a WebClient and DownloadString ("www.example.com") but that will only give me the source page - the static text.
I want to get the final document, let's say after Javascript added its elements and jQuery is finished its set-up.
Dim Client As New WebClient
Dim HTML = WebClient.DownloadString("http://www.example.com")
To access/parse more of the HTML I can also use MSHTML.dll to be able to go through the page, element, by element.
Dim Client As New WebClient
Dim Data As Stream = Client.OpenRead(New Uri("http://example.com"))
Dim Reader As New StreamReader(Data)
Dim HTML As String = Reader.ReadToEnd
Dim Document As IHTMLDocument2 = DirectCast(New mshtml.HTMLDocument(), IHTMLDocument2)
Document.write(HTML)
Dim Elements As IHTMLElementCollection = Document.all
For Each Element As IHTMLElement In Elements
'here I can access things like the elements ids, tag innerHTML and so forth
Next
But neither of these will give me the actual rendered document.
Although I could create an WebBrowser control, goto the URL and through that access the page's content - but if possible - that's not the way I want to do it.
Upvotes: 1
Views: 2436
Reputation: 3082
For web pages that load content dynamically, you have to discover the urls that are being called to fetch content by the web page script. Use a tool like fiddler to see the urls. Once you have that information, use WebClient
to fetch the content.
Upvotes: 4