Reputation: 698
I'm trying to scrape stats for Tennis players from this website into an excel document for easier viewing: http://www.tennisabstract.com/cgi-bin/player.cgi?p=DominicThiem (example player).
I've done a few scrapes before with VBA, so I'm not totally new to scraping, but I'm aware that this problem is more likely due to my lack on knoweledge than an inherent problem with the applications I'm using! When I debug.print
the inner HTML of the above webpage, the HTML is not the same as when I inspect element on the page. This is the full HTML returned by the scrape, which clearly is not complete:
<TABLE width=1280>
<TBODY>
<TR>
<TD align=left>This site is slow and ugly in Internet Explorer. In <A href="https://www.google.com/intl/en/chrome/browser/">Chrome</A> or <A href="http://www.mozilla.org/en-US/firefox/new/">Firefox</A>, it is much faster and a bit less ugly.<BR> </TD>
<TD align=right><A href="http://www.tennisabstract.com/blog">HeavyTopspin.com</A> | <A href="https://twitter.com/#!/tennisabstract">Twitter</A> </TD></TR></TBODY></TABLE>
<DIV id=header>
<TABLE width=1240>
<TBODY>
<TR>
<TD class=headline align=left>
<H1><SPAN class=blackheader><A href="http://www.tennisabstract.com/">tennis<SPAN style="COLOR: blue">abstract</SPAN>.com</SPAN></H1></A></TD>
<TD id=playersearch style="VERTICAL-ALIGN: top" align=right> </TD></TR>
<TR>
<TD> </TD>
<TD> </TD></TR>
<TR>
<TD style="VERTICAL-ALIGN: top" align=left><SPAN id=bio> </SPAN> <SPAN id=tabResults class=tabview style="BACKGROUND-COLOR: #e6eeee"> <B>Singles Results</B> </SPAN> <SPAN id=tabHead class="tablink tabview" style="BACKGROUND-COLOR: #e6eeee"> <B>Head-to-Heads</B> </SPAN> <SPAN id=tabEvents class="tablink tabview" style="BACKGROUND-COLOR: #e6eeee"> <B>Event Records</B> </SPAN> <BR><SPAN id=tabDubs class="tablink tabview" style="POSITION: relative; TOP: 5px; BACKGROUND-COLOR: #e6eeee"> <B>Doubles Results</B> </SPAN><SPAN id=tabDubsSpacer> </SPAN> <SPAN id=shotsHere></SPAN></TD>
<TD id=wonloss style="VERTICAL-ALIGN: top" align=right> </TD></TR>
<TR>
<TD id=tabmenu style="VERTICAL-ALIGN: top" align=left> </TD>
<TD> </TD></TR>
<TR>
<TD id=playernews colSpan=2> </TD></TR></TR></TBODY></TABLE></DIV>
<DIV id=main>
<TABLE id=maintable width=1280>
<TBODY>
<TR id=tabletoggles>
<TD> </TD>
<TD id=tablelabel> </TD>
<TD id=abovestats class=abovestats align=right> <SPAN class="revscore likelink"></SPAN> <B>Stats:</B> <SPAN class="statsa stattab">Overview</SPAN><SPAN class=statspacer> | </SPAN><SPAN class="statso stattab">Serve</SPAN> | <SPAN class="statsr stattab likelink">Return</SPAN> | <SPAN class="statsw stattab likelink">Raw</SPAN> </TD></TR>
<TR>
<TD id=footer class=footer> </TD>
<TD id=stats class=stats colSpan=2>
<TABLE id=matches>
<TBODY></TBODY></TABLE></TD></TR>
<TR>
<TD id=belowmenus> <BR> <BR> <BR> <BR> </TD>
<TD id=belowmatches colSpan=2> </TD></TR></TBODY></TABLE></DIV>
<DIV></DIV>
I am trying to scrape the 'main' table on the bottom half of the page, that shows the player's most recent matches (id = "matches", 6th line from the bottom of the above HTML). When I inspect element on the web page itself, the table looks like it should be very easy to scrape, but my scrape isn't returning the full HTML so I can't reference anything in my code.
I have considered that this may be due to my XML not fully loading, so have tried this, which returns the same HTML.
Here is the code I am using so far:
Sub TennisStats()
Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Application.ScreenUpdating = False
player1 = ThisWorkbook.Worksheets(1).Range("B1").Value
player1 = Replace(player1, " ", "")
With XMLPage
.Open "GET", "http://www.tennisabstract.com/cgi-bin/player.cgi?p=" & player1
.send
HTMLDoc.body.innerHTML = .responseText
End With
Debug.Print HTMLDoc.body.innerHTML
End Sub
Upvotes: 0
Views: 302
Reputation: 84465
You can't use XMLHTTP as there isn't time for the data to load.
Also, it didn't seem to want to work for me using IE, via Internet Controls, so I switched to Chrome using selenium basic. If you install selenium basic you also need to go VBE > Tools > References > Add references to Selenium Type Library. Selenium supports variety of browsers including FireFox, Chrome, IE and Opera.
If using Chrome ensure you have the latest Chrome install as well as latest ChromeDriver install and that the ChromeDriver.exe is in a folder which is on the environmental path, unless you wish to pass the path to the executable as an argument to selenium.
Option Explicit
Public Sub GetTable()
Dim d As WebDriver, html As HTMLDocument, hTable As HTMLTable
Set d = New ChromeDriver: Set html = New HTMLDocument
Const URL = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=DominicThiem"
With d
.Start "Chrome"
.get URL
html.body.innerHTML = .PageSource
Set hTable = html.getElementById("matches")
WriteTable hTable, 1, ThisWorkbook.Worksheets("Sheet1")
.Quit
End With
End Sub
Public Sub WriteTable(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
If ws Is Nothing Then Set ws = ActiveSheet
Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
r = startRow
With ws
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
.Cells(startRow + 1, columnCounter) = header.innerText
Next header
Set tRow = hTable.getElementsByTagName("tr")
For Each tr In tRow
r = r + 1
Set tCell = tr.getElementsByTagName("td")
c = 1
For Each td In tCell
Select Case c
Case 16
.Cells(r, c).Value = "'" & td.innerText
Case Else
.Cells(r, c).Value = td.innerText
End Select
c = c + 1
Next td
Next tr
End With
End Sub
Sample webpage view:
Sample selenium output:
Upvotes: 2