bm13563
bm13563

Reputation: 698

Full HTML of page not returned by XML response

I'm trying to scrape stats for Tennis players from this website into an excel document for easier viewing: http://www.tennisabstract.com/cgi-bin/player.cgi?p=DominicThiem (example player).

I've done a few scrapes before with VBA, so I'm not totally new to scraping, but I'm aware that this problem is more likely due to my lack on knoweledge than an inherent problem with the applications I'm using! When I debug.print the inner HTML of the above webpage, the HTML is not the same as when I inspect element on the page. This is the full HTML returned by the scrape, which clearly is not complete:

<TABLE width=1280>
<TBODY>
<TR>
<TD align=left>This site is slow and ugly in Internet Explorer. In <A href="https://www.google.com/intl/en/chrome/browser/">Chrome</A> or <A href="http://www.mozilla.org/en-US/firefox/new/">Firefox</A>, it is much faster and a bit less ugly.<BR>&nbsp;</TD>
<TD align=right><A href="http://www.tennisabstract.com/blog">HeavyTopspin.com</A> | <A href="https://twitter.com/#!/tennisabstract">Twitter</A> </TD></TR></TBODY></TABLE>
<DIV id=header>
<TABLE width=1240>
<TBODY>
<TR>
<TD class=headline align=left>
<H1><SPAN class=blackheader><A href="http://www.tennisabstract.com/">tennis<SPAN style="COLOR: blue">abstract</SPAN>.com</SPAN></H1></A></TD>
<TD id=playersearch style="VERTICAL-ALIGN: top" align=right>&nbsp;</TD></TR>
<TR>
<TD>&nbsp;</TD>
<TD>&nbsp;</TD></TR>
<TR>
<TD style="VERTICAL-ALIGN: top" align=left><SPAN id=bio>&nbsp;</SPAN> <SPAN id=tabResults class=tabview style="BACKGROUND-COLOR: #e6eeee">&nbsp;<B>Singles Results</B>&nbsp;</SPAN>&nbsp; <SPAN id=tabHead class="tablink tabview" style="BACKGROUND-COLOR: #e6eeee">&nbsp;<B>Head-to-Heads</B>&nbsp;</SPAN>&nbsp; <SPAN id=tabEvents class="tablink tabview" style="BACKGROUND-COLOR: #e6eeee">&nbsp;<B>Event Records</B>&nbsp;</SPAN> <BR><SPAN id=tabDubs class="tablink tabview" style="POSITION: relative; TOP: 5px; BACKGROUND-COLOR: #e6eeee">&nbsp;<B>Doubles Results</B>&nbsp;</SPAN><SPAN id=tabDubsSpacer>&nbsp;&nbsp;</SPAN> <SPAN id=shotsHere></SPAN></TD>
<TD id=wonloss style="VERTICAL-ALIGN: top" align=right>&nbsp;</TD></TR>
<TR>
<TD id=tabmenu style="VERTICAL-ALIGN: top" align=left>&nbsp; </TD>
<TD>&nbsp;</TD></TR>
<TR>
<TD id=playernews colSpan=2>&nbsp;</TD></TR></TR></TBODY></TABLE></DIV>
<DIV id=main>
<TABLE id=maintable width=1280>
<TBODY>
<TR id=tabletoggles>
<TD>&nbsp;</TD>
<TD id=tablelabel>&nbsp;</TD>
<TD id=abovestats class=abovestats align=right>&nbsp;&nbsp;&nbsp;<SPAN class="revscore likelink"></SPAN> &nbsp;&nbsp;&nbsp;<B>Stats:</B>&nbsp; <SPAN class="statsa stattab">Overview</SPAN><SPAN class=statspacer>&nbsp;|&nbsp;</SPAN><SPAN class="statso stattab">Serve</SPAN>&nbsp;|&nbsp;<SPAN class="statsr stattab likelink">Return</SPAN>&nbsp;|&nbsp;<SPAN class="statsw stattab likelink">Raw</SPAN> </TD></TR>
<TR>
<TD id=footer class=footer>&nbsp;</TD>
<TD id=stats class=stats colSpan=2>
<TABLE id=matches>
<TBODY></TBODY></TABLE></TD></TR>
<TR>
<TD id=belowmenus>&nbsp;<BR>&nbsp;<BR>&nbsp;<BR>&nbsp;<BR>&nbsp;</TD>
<TD id=belowmatches colSpan=2>&nbsp;</TD></TR></TBODY></TABLE></DIV>
<DIV></DIV>

I am trying to scrape the 'main' table on the bottom half of the page, that shows the player's most recent matches (id = "matches", 6th line from the bottom of the above HTML). When I inspect element on the web page itself, the table looks like it should be very easy to scrape, but my scrape isn't returning the full HTML so I can't reference anything in my code.

I have considered that this may be due to my XML not fully loading, so have tried this, which returns the same HTML.

Here is the code I am using so far:

Sub TennisStats()

    Dim XMLPage As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument

    Application.ScreenUpdating = False

    player1 = ThisWorkbook.Worksheets(1).Range("B1").Value
    player1 = Replace(player1, " ", "")

    With XMLPage
        .Open "GET", "http://www.tennisabstract.com/cgi-bin/player.cgi?p=" & player1
        .send

        HTMLDoc.body.innerHTML = .responseText

    End With

    Debug.Print HTMLDoc.body.innerHTML

End Sub

Upvotes: 0

Views: 302

Answers (1)

QHarr
QHarr

Reputation: 84465

You can't use XMLHTTP as there isn't time for the data to load.

Also, it didn't seem to want to work for me using IE, via Internet Controls, so I switched to Chrome using selenium basic. If you install selenium basic you also need to go VBE > Tools > References > Add references to Selenium Type Library. Selenium supports variety of browsers including FireFox, Chrome, IE and Opera.

If using Chrome ensure you have the latest Chrome install as well as latest ChromeDriver install and that the ChromeDriver.exe is in a folder which is on the environmental path, unless you wish to pass the path to the executable as an argument to selenium.

Option Explicit
Public Sub GetTable()
    Dim d As WebDriver, html As HTMLDocument, hTable As HTMLTable
    Set d = New ChromeDriver: Set html = New HTMLDocument
    Const URL = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=DominicThiem"

    With d
        .Start "Chrome"
        .get URL
        html.body.innerHTML = .PageSource
        Set hTable = html.getElementById("matches")
        WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
        .Quit
    End With
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet
    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
    r = startRow
    With ws
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            .Cells(startRow + 1, columnCounter) = header.innerText
        Next header
        Set tRow = hTable.getElementsByTagName("tr")
        For Each tr In tRow
            r = r + 1
            Set tCell = tr.getElementsByTagName("td")
            c = 1
            For Each td In tCell
                Select Case c
                Case 16
                    .Cells(r, c).Value = "'" & td.innerText
                Case Else
                    .Cells(r, c).Value = td.innerText
                End Select
                c = c + 1
            Next td
        Next tr
    End With
End Sub

Sample webpage view:

enter image description here


Sample selenium output:

Upvotes: 2

Related Questions