Darren LSH
Darren LSH

Reputation: 35

Extract data from HTML Element - VBA

I'm new to web scraping and the HTML language.

I'm trying to write a code in VBA to extract data from the following website: https://companies.govmu.org:4343/MNSOnlineSearch/

I have an Excel sheet with over 5000 company names and their respective "File No" in columns A and B respectively, and I need to input their "Status" (either "Live" or "Defunct") in column C. This will be done after searching for each company by "File No" and then extracting their status to the Excel sheet.

The issue is that I can't seem to get the element containing the data that I need.

I've already written the bit of code which will extract the "File No" from my Excel sheet, paste it on the webpage in the "File No" search box, and run the search. (You can try searching C5113, as an example).

However, on the resulting webpage, I've tried getting the element containing the data that I need, but it does not work.

For example, I tried to MsgBox (MsgBox is my personal way to check whether my variable contains the data I need) the inner HTML of the tag fieldset (fs) with ID "CompanyList" as shown in the code below, but it returns an error.

I've also tried with another variable named div, of data type HTMLDivElement, and then getting the element by ID "companies".

And finally, I've also tried looping through a variable of type IHTMLElementCollection to look for the element that I need, but it still does not show the element that I need (it shows other elements that I don't need).

Option Explicit

Sub ExtractStatusDetails()
    Dim ie As InternetExplorer
    Dim html As HTMLDocument
    Dim resultHtml As HTMLDocument
    Dim fs As IHTMLElement
    Dim searchBoxes As IHTMLElementCollection
    Dim searchButton As Object
    Dim homePage As String

    homePage = "https://companies.govmu.org:4343/MNSOnlineSearch/"
    Set ie = New InternetExplorer   
    ie.Visible = False
    ie.navigate homePage
    Do While ie.readyState <> READYSTATE_COMPLETE
        DoEvents
    Loop

    Set html = ie.document
    Set searchBoxes = html.getElementsByClassName("col-md-6 col-lg-4")

    searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "placeholder", "value")
    searchBoxes(0).innerHTML = Replace(searchBoxes(0).innerHTML, "Search company by File No...", "C63")

    Set searchButton = searchBoxes(0).getElementsByClassName("btn btn-large btn-primary btn-raised")
    searchButton(0).Click
    Do While ie.readyState <> READYSTATE_COMPLETE
        DoEvents
    Loop

    Set resultHtml = ie.document
    Set fs = resultHtml.getElementById("CompanyList")

    MsgBox fs.innerHTML

    ie.Quit

End Sub

Upvotes: 2

Views: 2194

Answers (1)

QHarr
QHarr

Reputation: 84465

The page does an xmlhttp POST request which retrieves data from a backend data store (likely Oracle GlassFish > JDBC API > data repository e.g. MySQL) . It returns all similar matches, possibly including exact.

You can find the POST request in the network traffic of browser dev tools after you enter the fileNo and press the search button.

Below is a function you can call in a loop over your fileNos to retrieve the company status

Option Explicit

Public Sub test()
    Dim fileNo As String, xmlhttp As Object
    fileNo = "C5113"
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    MsgBox GetCompanyStatus(fileNo, xmlhttp)
End Sub

Public Function GetCompanyStatus(ByVal fileNo As String, ByVal xmlhttp As Object) As String
    Dim html As HTMLDocument, body As String, fileNos As Object, i As Long

    Set html = New HTMLDocument
    body = "tabs=tab-1&searchByName=&searchByFileNo=PLACEHOLDER&submitCompanies=&searchByBusName=&searchByBRN=&searchByIncDateFrom=&searchByIncDateTo=&doAction=search"

    With xmlhttp
        .Open "POST", "https://companies.govmu.org:4343/MNSOnlineSearch/GetCompanies", False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        .send Replace$(body, "PLACEHOLDER", fileNo)
        html.body.innerHTML = .responseText
        Set fileNos = html.querySelectorAll("td.tdFileNo")
        If fileNos.Length > 0 Then
            For i = 0 To fileNos.Length - 1
                If fileNos.item(i).innerText = fileNo Then
                    GetCompanyStatus = html.querySelectorAll("td.tdStatus").item(i).innerText
                    Exit Function
                End If
            Next i
        End If
        GetCompanyStatus = "Not found"
    End With
End Function

I would instead consider how you can group your requests. As you can post partial file numbers you could cut down on the number of requests considerably by doing in batches with partial file numbers e.g. search for C5 or C51; the backend then does something like "C5%" to return all matches starting with the specified string, and then loop those results searching for your file numbers of interest that fall within that range.

You could have a dictionary with fileNo as key and status as value and update this as you loop the results returned by a request. I think the number of keys is constrained by Long, so no problem I think for storing all your fileNos at the start, in a dictionary, and updating later during requests. You could even have multiple dictionaries that host ranges of fileNos, like the volumes of the good old Encyclopædia Britannica. This would limit loops to dicts you hope to populate from the same request, for example. Is something to play around with an explore.

Upvotes: 1

Related Questions