Graham Dredt
Graham Dredt

Reputation: 33

Need help to convert Internet Explorer based web scraping to XMLHTTP

I am trying to speed up some intranet webscraping as well as make it more reliable. I am just learning how to implement XMLHTTP and I need some advice on converting my code from IE based scrapping to XMLHTTP.

I have 2 subs in my module that accomplishes loading up and navigating the intranet site (GetWebTable) and parsing through the data (GetOneTable) to return a table in excel. The subs are as follows:

Sub GetWebTable(sAccountNum As String)

On Error Resume Next

Dim objIE           As Object
Dim strBuffer       As String
Dim thisCol         As Integer
Dim iAcctCount      As Integer
Dim iCounter        As Integer
Dim iNextCounter    As Integer
Dim iAcctCell       As Integer
Dim thisColCustInfo As Integer
Dim iErrorCounter As Integer

    If InStr(1, sAccountNum, "-") <> 0 Then
        sAccountNum = Replace(sAccountNum, "-", "")
    End If

    If InStr(1, sAccountNum, " ") <> 0 Then
        sAccountNum = Replace(sAccountNum, " ", "")
    End If

    iErrorCounter = 1
    TRY_AGAIN:

    'Spawn Internet Explorer
        Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-XXXXXXX}") 
        DoEvents

        With objIE
            .Visible = False
            .Navigate "http://intranetsite.aspx"

            While .busy = True Or .readystate <> 4: DoEvents: Wend
            While .Document.readyState <> "complete": DoEvents: Wend

            .Document.getElementById("ctl00_MainContentRegion_tAcct").Value = sAcct

            While .busy = True Or .readyState <> 4: DoEvents: Wend
            While .Document.readyState <> "complete": DoEvents: Wend

            .Document.getElementById("ctl00_MainContentRegion_btnRunReport").Click

            While .busy = True Or .readyState <> 4: DoEvents: Wend
            While .Document.readyState <> "complete": DoEvents: Wend

        End With

            thisCol = 53
            thisColCustInfo = 53

        GetOneTable objIE.Document, 9, thisCol

        'Cleanup:
        objIE.Quit
        Set objIE = Nothing

    GetWebTable_Error:
    Select Case Err.Number
        Case 0
        Case Else
        Debug.Print Err.Number, Err.Description
        iErrorCounter = iErrorCounter + 1
        objIE.Quit
        Set objIE = Nothing
        If iErrorCounter > 4 Then On Error Resume Next
        GoTo TRY_AGAIN

            'Stop
    End Select
End Sub



Sub GetOneTable(varWebPageDoc, varTableNum, varColInsert)

Dim varDocElement   As Object ' the elements of the document
Dim varDocTable     As Object ' the table required
Dim varDocRow       As Object ' the rows of the table
Dim varDocCell      As Object ' the cells of the rows.
Dim Rng             As Range
Dim iCellCount      As Long
Dim iElemCount      As Long
Dim iTableCount     As Long
Dim iRowCount       As Long
Dim iRowCounter As Integer
Dim bTableEndFlag As Boolean

bTableEndFlag = False

For Each varDocElement In varWebPageDoc.all
    If varDocElement.nodeName = "TABLE" Then
        iElemCount = iElemCount + 1
    End If

    If iElemCount = varTableNum Then
        Set varDocTable = varDocElement

        iTableCount = iTableCount + 1
        iRowCount = iRowCount + 1
        Set Rng = Worksheets("Sheet1").Cells(2, varColInsert)

        For Each varDocRow In varDocTable.Rows

            For Each varDocCell In varDocRow.Cells
                If Left(varDocCell.innerText, 9) = "Total for" Then
                    bTableEndFlag = True
                    Exit For
                End If
                Rng.Value = varDocCell.innerText
                Set Rng = Rng.Offset(, 1)
                iCellCount = iCellCount + 1
            Next varDocCell

            iRowCount = iRowCount + 1
            Set Rng = Rng.Offset(1, -iCellCount)
            iCellCount = 0

        Next varDocRow

        Exit For

    End If

Next varDocElement

Set varDocElement = Nothing
Set varDocTable = Nothing
Set varDocRow = Nothing
Set varDocCell = Nothing
Set Rng = Nothing

End Sub

Any thoughts?

Upvotes: 1

Views: 463

Answers (1)

S Meaden
S Meaden

Reputation: 8260

HTML is not XML. XML is strictly enforced is terms of opening and closing tags whilst HTML is famous for <br> tags without closuing </br>. You'd be very lucky if the HTML is XML compliant.

Anyway, if you want to use XMLHTTP because of the HTTP request and still keep your IE based web scraping code then see this article http://exceldevelopmentplatform.blogspot.com/2018/01/vba-xmlhttp-request-xhr-does-not-parse.html It shows how to use XMLHTTP before passing response to MSHTML.

You can use MSHTML independently of IE, see this article Use MSHTML to parse local HTML file without using Internet Explorer (Microsoft HTML Object Library). If you read that you will see much of the code that you write against the IE object model is in fact aaginst the MSHTML object model and as such you can decouple and jettison IE. Enjoy!

EDIT1: Don't forget you can ask your company's IT staff

You say it is an intranet site which implies internal to your company, you could ask the programmers who are responsible for that system for a direct API guide.

EDIT2: Folding in feedback about how to mimic a browser...

To mimic the browser you need to figure out the traffic that button clicks generate...

To watch network traffic I recommend you switch to Chrome as your browser. Then, on this web page, right-click mouse button and take "Inspect" menu option, this opens the Chrome Developer Tools. Then, in Developer Tools select the Network tab, then click on a link on this page and you will see the traffic that is generated.

So, if you want to go pure XMLHTTP and leave browsers behind then you won't have buttons available to click but you can observe the network traffic that happens when a button is clicked in a browser and you can then mimic this in code.

So for example, in your comment you ask how do I enter an account number and click the button. I'm guessing that clicking a button will result in a XMLHTTP call of something like http://example.com/dowork/mypage.asp?accountnumber=1233456&otherParams=true so you see account number would be buried in the query parameters. Once you have that url you can put that in your XMLHTTP request.

One potential problem is that system designers may have chosen to hide account numbers in the body of a HTTP POST because it is sensitive/confidential data. However, Chrome Developer Tools is very good and should still yield that information but may have to poke around.

Upvotes: 1

Related Questions