trevbet
trevbet

Reputation: 147

MSXML2.XMLHTTP page request: How do you make sure you get ALL of the final HTML code?

I've used this simple subroutine for loading HTML documents from the web for some time now with no problems:

Function GetSource(sURL As String) As Variant

' Purpose:   To obtain the HTML text of a web page
' Receives:  The URL of the web page
' Returns:   The HTML text of the web page in a variant

Dim oXHTTP As Object, n As Long

Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
oXHTTP.Open "GET", sURL, False
oXHTTP.send
GetSource = oXHTTP.responsetext
Set oXHTTP = Nothing

End Function

but I've run into a situation where it only loads part of a page most of the time (not always -- sometimes it loads all of the expected HTML code). If you SAVE the HTML of the page to another file on the web from a browser, the subroutine will always read it with no problem.

I'm guessing that the issue is timing -- that the dynamic page registers "done" while a script is still filling in details. Sometimes it completes in time, other times it doesn't.

Has anyone ever encountered this behavior before and surmounted it? It seems that there should be a way of capturing via the MSXML2.XMLHTTP object exactly what you'd get if went to the page and chose the save to HTML option.

If you'd like to see the behavior for yourself, here's a sample of a page that doesn't load consistently:

http://www.tiff.net/festivals/thefestival/programmes/specialpresentations/mr-turner

and here's a saved HTML file of that same page:

http://tofilmfest.ca/2014/film/fest/Mr_Turner.htm

Is there any known workaround for this?

Upvotes: 0

Views: 3321

Answers (2)

trevbet
trevbet

Reputation: 147

Following Alex's suggestion, here's how to do it without a brute force fixed delay:

Function GetHTML(ByVal strURL as String) As Variant
  Dim oIE As InternetExplorer
  Dim hElm As IHTMLElement
  Set oIE = New InternetExplorer
  oIE.Navigate strURL
  Do While (oIE.Busy Or oIE.ReadyState <> READYSTATE_COMPLETE)
     DoEvents
  Loop
  Set hElm = oIE.Document.all.tags("html").Item(0)
  GetHTML = hElm.outerHTML
  Set oIE = Nothing
  Set hElm = Nothing
End Function

Upvotes: 1

trevbet
trevbet

Reputation: 147

I found a workaround that gives me what I want. I control Internet Explorer programmatically and invoke a three-second delay after I tell it to navigate to a page to enable the content to finish loading. Then I extract the HTML code by using an IHTMLElement from Microsoft's HTML library. It's not pretty, but it retrieves all of the HTML code for every page I've tried it with. If anybody has a better way accomplishing the same end, feel free to show off.

Function testbrowser() As Variant
   Dim oIE As InternetExplorer
   Dim hElm As IHTMLElement
   Set oIE = New InternetExplorer
   oIE.Height = 600
   oIE.Width = 800
   oIE.Visible = True
   oIE.Navigate "http://www.tiff.net/festivals/thefestival/programmes/galapresentations/the-riot-club"
   Call delay(3)
   Set hElm = oIE.Document.all.tags("html").Item(0)
   testbrowser = hElm.outerHTML
End Function

Sub delay(ByVal secs As Integer)
   Dim datLimit As Date
   datLimit = DateAdd("s", secs, Now())
   While Now() < datLimit
   Wend
End Sub

Upvotes: 2

Related Questions