Reputation: 35
I'm new to web scraping and the HTML language.
I'm trying to write a code in VBA to extract data from the following website: https://companies.govmu.org:4343/MNSOnlineSearch/
I have an Excel sheet with over 5000 company names and their respective "File No" in columns A and B respectively, and I need to input their "Status" (either "Live" or "Defunct") in column C. This will be done after searching for each company by "File No" and then extracting their status to the Excel sheet.
The issue is that I can't seem to get the element containing the data that I need.
I've already written the bit of code which will extract the "File No" from my Excel sheet, paste it on the webpage in the "File No" search box, and run the search. (You can try searching C5113, as an example).
However, on the resulting webpage, I've tried getting the element containing the data that I need, but it does not work.
For example, I tried to MsgBox (MsgBox is my personal way to check whether my variable contains the data I need) the inner HTML of the tag fieldset (fs) with ID "CompanyList" as shown in the code below, but it returns an error.
I've also tried with another variable named div, of data type HTMLDivElement, and then getting the element by ID "companies".
And finally, I've also tried looping through a variable of type IHTMLElementCollection to look for the element that I need, but it still does not show the element that I need (it shows other elements that I don't need).
Option Explicit
Sub ExtractStatusDetails()
Dim ie As InternetExplorer
Dim html As HTMLDocument
Dim resultHtml As HTMLDocument
Dim fs As IHTMLElement
Dim searchBoxes As IHTMLElementCollection
Dim searchButton As Object
Dim homePage As String
homePage = "https://companies.govmu.org:4343/MNSOnlineSearch/"
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate homePage
Do While ie.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set html = ie.document
Set searchBoxes = html.getElementsByClassName("col-md-6 col-lg-4")
searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "placeholder", "value")
searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "Search company by File No...", "C63")
Set searchButton = searchBoxes(0).getElementsByClassName("btn btn-large btn-primary btn-raised")
searchButton(0).Click
Do While ie.readyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set resultHtml = ie.document
Set fs = resultHtml.getElementById("CompanyList")
MsgBox fs.innerHTML
ie.Quit
End Sub
Upvotes: 2
Views: 2194
Reputation: 84465
The page does an xmlhttp POST request which retrieves data from a backend data store (likely Oracle GlassFish > JDBC API > data repository e.g. MySQL) . It returns all similar matches, possibly including exact.
You can find the POST request in the network traffic of browser dev tools after you enter the fileNo and press the search button.
Below is a function you can call in a loop over your fileNos to retrieve the company status
Option Explicit
Public Sub test()
Dim fileNo As String, xmlhttp As Object
fileNo = "C5113"
Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
MsgBox GetCompanyStatus(fileNo, xmlhttp)
End Sub
Public Function GetCompanyStatus(ByVal fileNo As String, ByVal xmlhttp As Object) As String
Dim html As HTMLDocument, body As String, fileNos As Object, i As Long
Set html = New HTMLDocument
body = "tabs=tab-1&searchByName=&searchByFileNo=PLACEHOLDER&submitCompanies=&searchByBusName=&searchByBRN=&searchByIncDateFrom=&searchByIncDateTo=&doAction=search"
With xmlhttp
.Open "POST", "https://companies.govmu.org:4343/MNSOnlineSearch/GetCompanies", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.send Replace$(body, "PLACEHOLDER", fileNo)
html.body.innerHTML = .responseText
Set fileNos = html.querySelectorAll("td.tdFileNo")
If fileNos.Length > 0 Then
For i = 0 To fileNos.Length - 1
If fileNos.item(i).innerText = fileNo Then
GetCompanyStatus = html.querySelectorAll("td.tdStatus").item(i).innerText
Exit Function
End If
Next i
End If
GetCompanyStatus = "Not found"
End With
End Function
I would instead consider how you can group your requests. As you can post partial file numbers you could cut down on the number of requests considerably by doing in batches with partial file numbers e.g. search for C5 or C51; the backend then does something like "C5%" to return all matches starting with the specified string, and then loop those results searching for your file numbers of interest that fall within that range.
You could have a dictionary with fileNo
as key and status
as value and update this as you loop the results returned by a request. I think the number of keys is constrained by Long
, so no problem I think for storing all your fileNos at the start, in a dictionary, and updating later during requests. You could even have multiple dictionaries that host ranges of fileNos, like the volumes of the good old Encyclopædia Britannica. This would limit loops to dicts you hope to populate from the same request, for example. Is something to play around with an explore.
Upvotes: 1