Laymo
Laymo

Reputation: 43

Excel Pulling multiple Tables From a Website

I am working on a project to run some analytical models on NFL player stats. I have some code below that another user passed along to me. This code takes a list of links that I have on Sheet1, which is named "PlayerList", and creates a new tab for each player and pulls in their passing stats. All of the links are to Pro Football Reference. I am able to change this code to pull all necessary data for all positions other than quarterback. For the QBs I want to pull the passing stats table as well as the rushing and receiving stats table. Any help would be greatly appreciated. For reference here a few sample links:

https://www.pro-football-reference.com/players/R/RodgAa00.htm https://www.pro-football-reference.com/players/B/BreeDr00.htm

Below is the code:

Option Explicit
Public Sub GetInfo()
    Di  If InStr(links(link, 1), "https://") > 0 Then
            Set html = GetHTMLDoc(links(link, 1))
            Set hTable = html.getElementById("passing")
            If Not hTable Is Nothing Then
                playerName = GetNameAbbr(links(link, 1))
                Set ws = AddPlayerSheet(playerName)
                WriteTableToSheet hTable, ws
                FixTable ws
            End If
        End If
    Next
    Application.ScreenUpdating = True
End Sub

Public Function GetHTMLDoc(ByVal url As String) As HTMLDocument
    Dim sResponse As String, html As New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
    html.body.innerHTML = sResponse
    Set GetHTMLDoc = html
End Function

Public Sub WriteTableToSheet(ByVal hTable As HTMLTable, ByVal ws As Worksheet)
    Dim x As Long, y As Long
    With hTable
        For x = 0 To .Rows.Length - 1
            For y = 0 To .Rows(x).Cells.Length - 1
                If y = 6 Or y = 7 Then
                    ws.Cells(x + 4, y + 1).Value = Chr$(39) & .Rows(x).Cells(y).innerText
                Else
                    ws.Cells(x + 4, y + 1).Value = .Rows(x).Cells(y).innerText
                End If
            Next y
        Next x
    End With
End Sub

Public Function GetNameAbbr(ByVal url As String)
    Dim tempArr() As String
    tempArr = Split(url, "/")
    GetNameAbbr = Left$(tempArr(UBound(tempArr)), 6)
End Function

Public Function AddPlayerSheet(ByVal playerName As String) As Worksheet
    Dim ws As Worksheet
    If SheetExists(playerName) Then
        Application.DisplayAlerts = False
        ThisWorkbook.Worksheets(playerName).Delete
        Application.DisplayAlerts = True
    End If
    Set ws = ThisWorkbook.Worksheets.Add
    ws.Name = playerName
    Set AddPlayerSheet = ws
End Function

Public Function SheetExists(ByVal playerName As String) As Boolean
    SheetExists = Evaluate("ISREF('" & playerName & "'!A1)")
End Function

Public Sub FixTable(ByVal ws As Worksheet)
    Dim found As Range, numSummaryRows As Long
    With ws
        Set found = .Columns("A").Find("Career")
        If found Is Nothing Then Exit Sub
        numSummaryRows = .Cells(.Rows.Count, "A").End(xlUp).Row - found.Row
        numSummaryRows = IIf(numSummaryRows = 0, 1, numSummaryRows + 1)
        Debug.Print found.Offset(, 1).Resize(numSummaryRows, 30).Address, ws.Name
        found.Offset(, 1).Resize(numSummaryRows, 30).Copy found.Offset(, 2)
        found.Offset(, 1).Resize(numSummaryRows, 1).ClearContents
    End With
End Subm html As New HTMLDocument, links(), link As Long, wsSourceSheet As Worksheet
    Dim hTable As HTMLTable, ws As Worksheet, playerName As String
    Set wsSourceSheet = ThisWorkbook.Worksheets("PlayerList")
    Application.ScreenUpdating = False
    With wsSourceSheet
        links = .Range("C2:C" & .Cells(.Rows.Count, "C").End(xlUp).Row).Value
    End With
    For link = LBound(links, 1) To UBound(links, 1)

Upvotes: 2

Views: 1652

Answers (2)

QHarr
QHarr

Reputation: 84475

Yes there is a reason for doing this with VBA. In fact at least five.....

  1. You don't manually have to set up it up for all the links, which if you have a very long list means you would end up having to turn to automation anyway;
  2. On a related theme, powerquery has limitations on how many connections it can support and with NFL player lists you can easily go way beyond what is supported and end up, even when at the max number of connections allowed, with a workbook that crashes or grinds to a halt (I have been there!);
  3. Both tables are not always present so the below has error handling to deal with that;
  4. You get your player named sheets as before, and again error handling for if sheet already present;
  5. Not all versions of powerquery have the nice interface which will allow you to select all the tables individually for these pages. My version of Excel 2016 basically offers only to select the entire page. In that case you have more data than you need and a slowed down process.

Whilst there may be ways to handle this with inbuilt tools, I love me a bit of powerquery, it is no longer "out of the box", but requires knowing how to code in M to some extent and/or reverting to using some VBA anyway.

If you tie this to a button on a sheet you can easily press to refresh when you want, link it to a workbook_open event to refresh on opening, even have windows scheduler open the workbook and refresh at certain times (just so you know VBA still got your back! Though maybe with a little help from my friends ♫ aka Windows).


It seems XHR is just a little too fast for the lower tables on each page, but do not despair, you could use Internet Explorer, with a short delay to ensure the Rushing & Receiving table is populated, or, as I have, use Selenium to automate the browser (I have used Chrome but Internet Explorer is possible). Although this is slower than XHR, we can be a little more efficient by running a headless browser instance.


Here you go with VBA which will give you the different tabs as you go and select only those tables required. Based on links in at C2 on sheet1.

Option Explicit
Public Sub GetInfo()
    Dim d As New ChromeDriver

    Dim html As New HTMLDocument, links(), link As Long, wsSourceSheet As Worksheet, clipboard As Object
    Dim hTablePass As HTMLTable, hTableRushReceive As HTMLTable, ws As Worksheet, playerName As String
    Set wsSourceSheet = ThisWorkbook.Worksheets("Sheet1") '<change to sheet containing links
    Application.ScreenUpdating = False
    With wsSourceSheet
        If .Cells(.Rows.Count, "C").End(xlUp).Row = 2 Then
            ReDim links(1 To 1, 1 To 1): links(1, 1) = .Range("C2")
        Else
            links = .Range("C2:C" & .Cells(.Rows.Count, "C").End(xlUp).Row).Value
        End If
    End With
    For link = LBound(links, 1) To UBound(links, 1)
        If InStr(links(link, 1), "https://") > 0 Then
            With d
                 .AddArgument "--headless"
                .get links(link, 1)
                html.body.innerHTML = .PageSource
                Set hTablePass = html.querySelector("#all_passing #passing")
                Set hTableRushReceive = html.querySelector("#all_rushing_and_receiving #rushing_and_receiving")
                playerName = GetNameAbbr(links(link, 1))
                Set ws = AddPlayerSheet(playerName)
                Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
                If Not hTablePass Is Nothing Then
                    clipboard.SetText Replace$(Replace$(hTablePass.outerHTML, "--></DIV>", vbNullString), "<!--", vbNullString)
                    clipboard.PutInClipboard
                    ws.Cells(GetLastRow(ws, 1), 1).PasteSpecial
                End If
                If Not hTableRushReceive Is Nothing Then
                    clipboard.SetText hTableRushReceive.outerHTML
                    clipboard.PutInClipboard
                    ws.Cells(GetLastRow(ws, 1) + 2, 1).PasteSpecial
                End If
            End With
        End If
    Next
    Application.ScreenUpdating = True
End Sub

Public Function GetNameAbbr(ByVal url As String) As String
    Dim tempArr() As String
    tempArr = Split(url, "/")
    GetNameAbbr = Left$(tempArr(UBound(tempArr)), 6)
End Function

Public Function AddPlayerSheet(ByVal playerName As String) As Worksheet
    Dim ws As Worksheet
    If SheetExists(playerName) Then
        Application.DisplayAlerts = False
        ThisWorkbook.Worksheets(playerName).Delete
        Application.DisplayAlerts = True
    End If
    Set ws = ThisWorkbook.Worksheets.Add
    ws.Name = playerName
    Set AddPlayerSheet = ws
End Function

Public Function SheetExists(ByVal playerName As String) As Boolean '<== *@Rory
    SheetExists = Evaluate("ISREF('" & playerName & "'!A1)")
End Function

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

Image


References:

  1. Microsoft HTML Object Library
  2. Selenium Type Library

Selenium basic download:

  1. https://github.com/florentbr/SeleniumBasic

*Function adapted from @Rory

Upvotes: 2

ashleedawg
ashleedawg

Reputation: 21657

Is there a reason you need to do this with VBA? Excel is quite capable of importing well-organized data such as the [several] tables on that page.

Under the Data tab, click From Web and then enter the Website URL.

img
Click images to enlarge

Next you will choose the table(s) that you want. Don't go nuts - only get what you need, but you can choose more than one tables by enabling the checkbox.

img

It can take a few long minutes to parse and organize all the data on the page...

img

Once you're back at the worksheet you'll see the queries on the right side. Right-click a query and choose Load To..., then choose Table and a location for the table data. There are a ton of other properties that you can customize; there are tutorials describing what you can do.

img

More things to customize are hidden in two ribbon tabs that only appear when you click on a table, Design and Query.

img

I think there's also a way to just create a list of players and then to use the Advanced option when entering the URL to allow you to dynamically choose any player you want, while only adding the tables once... but I've never quite figured that part out yet.

I'm not a sports fan, but I assume the data will be changing throughout the season, and an advantage of using tables like this is that once you set up your worksheet how you want it, there are settings you can choose to auto-update every time you open the workbook, or on schedule, or manually, or never; whatever is appropriate.

Google "Excel web query" to find out more about the plethora of options available to you when using queries (aka: "Get & Transform") to extract and organize your data.

Perhaps this could be an alternative to consider instead of coding functionality that's already built-in to Excel.

Good luck, and "Go Sports!"

img

Upvotes: 4

Related Questions