Reputation: 33
I am trying to speed up some intranet webscraping as well as make it more reliable. I am just learning how to implement XMLHTTP and I need some advice on converting my code from IE based scrapping to XMLHTTP.
I have 2 subs in my module that accomplishes loading up and navigating the intranet site (GetWebTable) and parsing through the data (GetOneTable) to return a table in excel. The subs are as follows:
Sub GetWebTable(sAccountNum As String)
On Error Resume Next
Dim objIE As Object
Dim strBuffer As String
Dim thisCol As Integer
Dim iAcctCount As Integer
Dim iCounter As Integer
Dim iNextCounter As Integer
Dim iAcctCell As Integer
Dim thisColCustInfo As Integer
Dim iErrorCounter As Integer
If InStr(1, sAccountNum, "-") <> 0 Then
sAccountNum = Replace(sAccountNum, "-", "")
End If
If InStr(1, sAccountNum, " ") <> 0 Then
sAccountNum = Replace(sAccountNum, " ", "")
End If
iErrorCounter = 1
TRY_AGAIN:
'Spawn Internet Explorer
Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-XXXXXXX}")
DoEvents
With objIE
.Visible = False
.Navigate "http://intranetsite.aspx"
While .busy = True Or .readystate <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
.Document.getElementById("ctl00_MainContentRegion_tAcct").Value = sAcct
While .busy = True Or .readyState <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
.Document.getElementById("ctl00_MainContentRegion_btnRunReport").Click
While .busy = True Or .readyState <> 4: DoEvents: Wend
While .Document.readyState <> "complete": DoEvents: Wend
End With
thisCol = 53
thisColCustInfo = 53
GetOneTable objIE.Document, 9, thisCol
'Cleanup:
objIE.Quit
Set objIE = Nothing
GetWebTable_Error:
Select Case Err.Number
Case 0
Case Else
Debug.Print Err.Number, Err.Description
iErrorCounter = iErrorCounter + 1
objIE.Quit
Set objIE = Nothing
If iErrorCounter > 4 Then On Error Resume Next
GoTo TRY_AGAIN
'Stop
End Select
End Sub
Sub GetOneTable(varWebPageDoc, varTableNum, varColInsert)
Dim varDocElement As Object ' the elements of the document
Dim varDocTable As Object ' the table required
Dim varDocRow As Object ' the rows of the table
Dim varDocCell As Object ' the cells of the rows.
Dim Rng As Range
Dim iCellCount As Long
Dim iElemCount As Long
Dim iTableCount As Long
Dim iRowCount As Long
Dim iRowCounter As Integer
Dim bTableEndFlag As Boolean
bTableEndFlag = False
For Each varDocElement In varWebPageDoc.all
If varDocElement.nodeName = "TABLE" Then
iElemCount = iElemCount + 1
End If
If iElemCount = varTableNum Then
Set varDocTable = varDocElement
iTableCount = iTableCount + 1
iRowCount = iRowCount + 1
Set Rng = Worksheets("Sheet1").Cells(2, varColInsert)
For Each varDocRow In varDocTable.Rows
For Each varDocCell In varDocRow.Cells
If Left(varDocCell.innerText, 9) = "Total for" Then
bTableEndFlag = True
Exit For
End If
Rng.Value = varDocCell.innerText
Set Rng = Rng.Offset(, 1)
iCellCount = iCellCount + 1
Next varDocCell
iRowCount = iRowCount + 1
Set Rng = Rng.Offset(1, -iCellCount)
iCellCount = 0
Next varDocRow
Exit For
End If
Next varDocElement
Set varDocElement = Nothing
Set varDocTable = Nothing
Set varDocRow = Nothing
Set varDocCell = Nothing
Set Rng = Nothing
End Sub
Any thoughts?
Upvotes: 1
Views: 463
Reputation: 8260
HTML is not XML. XML is strictly enforced is terms of opening and closing tags whilst HTML is famous for <br>
tags without closuing </br>
. You'd be very lucky if the HTML is XML compliant.
Anyway, if you want to use XMLHTTP because of the HTTP request and still keep your IE based web scraping code then see this article http://exceldevelopmentplatform.blogspot.com/2018/01/vba-xmlhttp-request-xhr-does-not-parse.html It shows how to use XMLHTTP before passing response to MSHTML.
You can use MSHTML independently of IE, see this article Use MSHTML to parse local HTML file without using Internet Explorer (Microsoft HTML Object Library). If you read that you will see much of the code that you write against the IE object model is in fact aaginst the MSHTML object model and as such you can decouple and jettison IE. Enjoy!
EDIT1: Don't forget you can ask your company's IT staff
You say it is an intranet site which implies internal to your company, you could ask the programmers who are responsible for that system for a direct API guide.
EDIT2: Folding in feedback about how to mimic a browser...
To mimic the browser you need to figure out the traffic that button clicks generate...
To watch network traffic I recommend you switch to Chrome as your browser. Then, on this web page, right-click mouse button and take "Inspect" menu option, this opens the Chrome Developer Tools. Then, in Developer Tools select the Network tab, then click on a link on this page and you will see the traffic that is generated.
So, if you want to go pure XMLHTTP and leave browsers behind then you won't have buttons available to click but you can observe the network traffic that happens when a button is clicked in a browser and you can then mimic this in code.
So for example, in your comment you ask how do I enter an account number and click the button. I'm guessing that clicking a button will result in a XMLHTTP call of something like http://example.com/dowork/mypage.asp?accountnumber=1233456&otherParams=true
so you see account number would be buried in the query parameters. Once you have that url you can put that in your XMLHTTP request.
One potential problem is that system designers may have chosen to hide account numbers in the body of a HTTP POST because it is sensitive/confidential data. However, Chrome Developer Tools is very good and should still yield that information but may have to poke around.
Upvotes: 1