Full HTML of page not returned by XML response

Question

I'm trying to scrape stats for Tennis players from this website into an excel document for easier viewing: http://www.tennisabstract.com/cgi-bin/player.cgi?p=DominicThiem (example player).

I've done a few scrapes before with VBA, so I'm not totally new to scraping, but I'm aware that this problem is more likely due to my lack on knoweledge than an inherent problem with the applications I'm using! When I debug.print the inner HTML of the above webpage, the HTML is not the same as when I inspect element on the page. This is the full HTML returned by the scrape, which clearly is not complete:




This site is slow and ugly in Internet Explorer. In Chrome or Firefox, it is much faster and a bit less ugly.
 
HeavyTopspin.com | Twitter 





tennisabstract.com
 

 
 

   Singles Results    Head-to-Heads    Event Records  
 Doubles Results    
 

  
 

 




 
 
       Stats:  Overview | Serve | Return | Raw

I am trying to scrape the 'main' table on the bottom half of the page, that shows the player's most recent matches (id = "matches", 6th line from the bottom of the above HTML). When I inspect element on the web page itself, the table looks like it should be very easy to scrape, but my scrape isn't returning the full HTML so I can't reference anything in my code.

I have considered that this may be due to my XML not fully loading, so have tried this, which returns the same HTML.

Here is the code I am using so far:

Sub TennisStats()

    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument

    Application.ScreenUpdating = False

    player1 = ThisWorkbook.Worksheets(1).Range("B1").Value
    player1 = Replace(player1, " ", "")

    With XMLPage
        .Open "GET", "http://www.tennisabstract.com/cgi-bin/player.cgi?p=" & player1
        .send

        HTMLDoc.body.innerHTML = .responseText

    End With

    Debug.Print HTMLDoc.body.innerHTML

End Sub

QHarr · Accepted Answer

You can't use XMLHTTP as there isn't time for the data to load.

Also, it didn't seem to want to work for me using IE, via Internet Controls, so I switched to Chrome using selenium basic. If you install selenium basic you also need to go VBE > Tools > References > Add references to Selenium Type Library. Selenium supports variety of browsers including FireFox, Chrome, IE and Opera.

If using Chrome ensure you have the latest Chrome install as well as latest ChromeDriver install and that the ChromeDriver.exe is in a folder which is on the environmental path, unless you wish to pass the path to the executable as an argument to selenium.

Option Explicit
Public Sub GetTable()
    Dim d As WebDriver, html As HTMLDocument, hTable As HTMLTable
    Set d = New ChromeDriver: Set html = New HTMLDocument
    Const URL = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=DominicThiem"

    With d
        .Start "Chrome"
        .get URL
        html.body.innerHTML = .PageSource
        Set hTable = html.getElementById("matches")
        WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
        .Quit
    End With
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet
    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            .Cells(startRow + 1, columnCounter) = header.innerText
        Next header
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            r = r + 1
            Set tCell = tr.getElementsByTagName("td")
            c = 1
            For Each td In tCell
                Select Case c
                Case 16
                    .Cells(r, c).Value = "'" & td.innerText
                Case Else
                    .Cells(r, c).Value = td.innerText
                End Select
                c = c + 1
            Next td
        Next tr
    End With
End Sub

Sample webpage view:

Sample selenium output:

Full HTML of page not returned by XML response

Answers (1)

Related Questions