Gggggggg
Gggggggg

Reputation: 31

Web scraping a table in VBA

I have navigated successfully to the page containing the table I wish to extract data from. Let me be upfront from the outset, this is the first time I am trying something like this, and I am really chuffed with myself for getting this far. I navigated to the webpage I wanted, updated username, password, and then navigated to the page containing the table I am interested in.

I am now trying to extract data from the table and I'm getting the following error, see below:

enter image description here

My code is as follows:

'==============================================================
'
Public IE                   As New SHDocVw.InternetExplorer
'==============================================================
'                           HTML DOCUMENT
'
Public HTMLDoc              As MSHTML.HTMLDocument
'==============================================================
'                           BUTTON COLLECTION
'
Public HTMLButtons          As MSHTML.IHTMLElementCollection
Public HTMLButton           As MSHTML.IHTMLElement
'==============================================================
'                           ATTRIBUTE COLLECTION
'
Public HTMLAs               As MSHTML.IHTMLElementCollection3
Public HTMLA                As MSHTML.IHTMLElement3
'==============================================================
'                           TABLE COLLECTION
'
Public HTMLTables           As MSHTML.IHTMLElementCollection
Public HTMLTable            As MSHTML.IHTMLElement
'==============================================================
'                           TABLE ELEMENTS
'
Public TableBody            As MSHTML.IHTMLElementCollection2
Public TableRows            As MSHTML.IHTMLElementCollection3
Public TableCell            As MSHTML.IHTMLElementCollection4
'==============================================================
Public RowNum               As Long
Public ColNum               As Long
'==============================================================
'
'
Public Sub TableCollection()

Worksheets.Add
RowNum = 1
Set TableBody = HTMLDoc.getElementsByTagName("tbody")
Set TableRows = HTMLDoc.getElementsByTagName("tr")
Set TableCell = HTMLDoc.getElementsByTagName("td")
For Each TableRows In TableBody
    ColNum = 1
    For Each TableCell In TableRows
        Cells(RowNum, ColNum).Value = TableCell.innerText
        ColNum = ColNum + 1
    Next TableCell
RowNum = RowNum + 1
Next TableRows
End Sub

"=====================================================================

And Below is one the header and one element of the table I am trying to scrape. I have replaced the URL with WEBADDRESS

<html><head>
  <title>
    Transaction SpreadSheet for the Current Month to date - April 2020</title>
</head>
<body>
<style>
  td { font-family:arial,verdana,sans-serif;font-size:12px;color:#000000;line-height:16px;}
</style>
<table cellpadding="2">
  <tbody>
  <tr>
    <td>
      <b>Date</b>
    </td>
    <td>
      <b>Reference</b>
    </td>
        <td>
      <b>Item</b>
    </td>
    <td>
      <b>Particulars</b>
    </td>
    <td>
      <b>Buyer</b>
    </td>
        <td>
      <b>Order Id</b>
    </td>
    <td>
      <b>Note</b>
    </td>
    <td>
      <b>Transaction Amount</b>
    </td>
   </tr>
<tr>
  <td>
    04&nbsp;Apr&nbsp;2020</td>
    <td>
    239137532</td>  
  <td>
    <a href="https://WEBADDRESS">461619577</a></td>
  <td>
    Success Fee</td>
  <td>
  <a title="User profile for Joe" href="WEBADDRESS">RoySch2510</a>
  </td>
    <td>
  <a href="https://WEBADDRESS" rel="nofollow,noindex">17314294</a>
  </td>
  <td>
    </td>   
  <td>
    -62.55</td>
  </tr>
<tr>

Please advice what I am doing wrong

OK Here is all my code I hope it gives more insight:

Option Explicit

Public Sub GetHTMLDocument()
'===========================================================================
'                         ESTABLISH PUBLIC VARIABLES
'
Call PublicHTMLVariables
'===========================================================================
'                              NAVIGATE TO IE
'
Call NavigateToIE("https://old.bidorbuy.co.za/jsp/login/UserLogin.jsp")
'===========================================================================
'                                   LOGIN
'
Call LoginToWebsite("JoeCam9517", "********")
'===========================================================================
'                           NAVIGATE TO 1st PAGE
'
Call NavigateToFirstPage
'===========================================================================
'                      NAVIGATE TO ACCOUNT HISTORY PAGE
'
Call NavigateToAccountsPage
'===========================================================================
'               CHANGE THE DATE RANGE FOR TRANSACTION SELECTION
'
'Call ChangeDateRange
'===========================================================================
'                      NAVIGATE TO ACCOUNT TABLE PAGE
'
Call NavigateToTablesPage
'===========================================================================
'                     COLLECT TABLE ELEMENTS TO WORKSHEET
'
Call TableCollection
'===========================================================================

MsgBox "Pause"
'                       MORE CODE STILL TO BE DEVELOPED

End Sub

PUBLIC VARIABLES

Option Explicit
'==============================================================
'
Public IE                   As New SHDocVw.InternetExplorer
'==============================================================
'                           HTML DOCUMENT
'
Public HTMLDoc              As MSHTML.HTMLDocument
'==============================================================
'                           HTML ELEMENTS
'
Public HTMLInput            As MSHTML.IHTMLElement
Public FromDay              As MSHTML.IHTMLElement
Public FromYearMonth        As MSHTML.IHTMLElement
Public ToDay                As MSHTML.IHTMLElement
'==============================================================
'                           BUTTON COLLECTION
'
Public HTMLButtons          As MSHTML.IHTMLElementCollection
Public HTMLButton           As MSHTML.IHTMLElement
'==============================================================
'                           ATTRIBUTE COLLECTION
'
Public HTMLAs               As MSHTML.IHTMLElementCollection3
Public HTMLA                As MSHTML.IHTMLElement3
'==============================================================
'                           TABLE COLLECTION
'
Public HTMLTable            As MSHTML.IHTMLElement
Public HTMLTableRows        As MSHTML.IHTMLElementCollection
Public HTMLTableCells       As MSHTML.IHTMLElementCollection
'==============================================================
'                           DATE ELEMENTS
'
Public ToYearMonth          As MSHTML.IHTMLElement
'==============================================================
'                           TABLE ELEMENTS
'
'Public TableBody            As MSHTML.IHTMLElementCollection2
'Public TableRows            As MSHTML.IHTMLElementCollection3
'Public TableCell            As MSHTML.IHTMLElementCollection4
'==============================================================
Public H                    As Integer
Public RowNum               As Long
Public ColNum               As Long
'==============================================================

Public Sub PublicHTMLVariables()

End Sub

Navigate to Webpage

Option Explicit

Public Sub NavigateToIE(Destination As String)
IE.Visible = True
IE.Navigate Destination
Do Until IE.ReadyState = 4
    DoEvents
Loop
End Sub

PREPARE TO LOGIN

Option Explicit
Public Sub LoginToWebsite(UserID As String, PassWord As String)
Set HTMLDoc = IE.Document
Set HTMLInput = HTMLDoc.getElementById("username")
    HTMLInput.Value = UserID
Set HTMLInput = HTMLDoc.getElementById("password")
    HTMLInput.Value = PassWord
End Sub

NAVIGATE TO FIRST PAGE

Option Explicit
'===========================================================================
'
'
Public Sub NavigateToFirstPage()
Set HTMLButtons = HTMLdoc.getElementsByTagName("button")
HTMLButtons(3).Click
Do While IE.ReadyState = 4: DoEvents: Loop
Do Until IE.ReadyState = 4: DoEvents: Loop
End Sub

NAVIGATE TO ACCOUNT HISTORY PAGE

Option Explicit

'===========================================================================
'                      NAVIGATE TO ACCOUNT HISTORY PAGE
'
Public Sub NavigateToAccountsPage()
H = 0
Set HTMLAs = HTMLdoc.getElementsByTagName("a")
For Each HTMLA In HTMLAs
    If HTMLA.href = "https://old.bidorbuy.co.za/jsp/fee/UserAccount.jsp" Then
        GoTo ButtonFound
    End If
    H = H + 1
Next HTMLA
ButtonFound:
HTMLAs(H).Click
Do While IE.ReadyState = 4: DoEvents: Loop
Do Until IE.ReadyState = 4: DoEvents: Loop
End Sub

CHANGE THE DATE RANGE - NOT WORKING - I'M GOING TO ASK FOR HELP ON THAT AT A LATER DATE

NAVIGATE TO TABLES PAGE

Option Explicit
 

'=========================================================================
'
'                      NAVIGATE TO ACCOUNT TABLE PAGE
'
Public Sub NavigateToTablesPage()
Set HTMLButtons = HTMLdoc.getElementsByName("DetailSubmit")
HTMLButtons(1).Click
End Sub

AND THAT BRINGS US TO THE PROCEDURE I'M HAVING A PROBLEM WITH

Option Explicit
'===========================================================================
'
'
Public Sub TableCollection()
Worksheets.Add

Dim HTMLdoc         As New HTMLDocument
Dim trow            As Object
Dim tcel            As Object
Dim rowNum          As Long
Dim colNum          As Long

rowNum = 1

For Each trow In HTMLdoc.getElementsByTagName("tbody")(0).getElementsByTagName("tr")
    colNum = 1
    For Each tcel In trow.getElementsByTagName("td")
        Cells(rowNum, colNum).Value = tcel.innerText
        colNum = colNum + 1
    Next tcel
    rowNum = rowNum + 1
Next trow
End Sub

'Set HTMLTable = HTMLDoc.getElementsByTagName("body")
'Set HTMLTableRows = HTMLdoc.getElementsByTagName("tr")
'Set HTMLTableCells = HTMLdoc.getElementsByTagName("td")
'For Each HTMLTableCells In HTMLTableRows
'Debug.Print HTMLTableRows.innerText
'Next HTMLTableCells
'    ColNum = 1
'    For Each TableCell In TableRows
'        Cells(RowNum, ColNum).Value = TableCell.innerText
'        ColNum = ColNum + 1
'    Next TableCell
'RowNum = RowNum + 1
'Next TableRows

I know that's a lot of someone else's code to look through, but I do try to write my code with the view that someone else may have to edit it. Also, I apologise that I'm not following normal convention, but it grates me when I see a variable start with a lowercase letter and then halfway through you get an upper case letter, it just doesn't look elegant, sorry:-)

I'm beginning to suspect that the problem is with the way the table is constructed, is that possible?

I want to say thank you to all of you who have tried to solve my problem, but I am still stuck with the same result. Using the above code I get through to this table:HTMLTable And then I get this error. HTMLError

As you will see from the commented-out code I have tried several different coding options, but I just keep getting an error.

Upvotes: 1

Views: 307

Answers (1)

Michał Duraj
Michał Duraj

Reputation: 63

I have written some function to read any HTML table. Try to use it. HTMLTab as an argument of the function has to be HTMLTable / IHTMLTable Object of course. :)

Function ReadTable(HTMLTab) As Variant
Dim myTable() As Variant

  rLen = HTMLTab.Rows.Length
  CLen = HTMLTab.Cells.Length / rLen
  ReDim myTable(0 To rLen - 1, 0 To CLen - 1)

  For Each myRow In HTMLTab.Rows
    j = 0
    For Each myCell In myRow.Cells
      myTable(i, j) = myCell.outerText
      j = j + 1
    Next myCell
    i = i + 1
  Next myRow

  ReadTable = myTable

End Function

Upvotes: 1

Related Questions