alphaService
alphaService

Reputation: 131

Excel 2016 VBA web scraping using getElementsByClassName

My system is working with Windows 10 64bit and Office 2016

I am a bit familiar with Excel VBA, and I am trying to learn web scraping using VBA. Unfortunately, there is little information on in-depth digging on web data.

The website I want to scrape data from is bizbuysell.com from the seller's offers such as

Sample URL 1 Sample URL 2

There is a section that starts with the headline Detailed Information The HTML code is:

Detailed Information

<h3>Detailed Information</h3>
Location:
Pinellas County, FL
Inventory:
Included in asking price
Employees:
8 FT

I want to scrape data from this section.

The problem is that there are some 18 data labels and their respective values possible, but only those are shown for which the seller has entered data.

My idea was to search for all possible data labels and if they are not available then next data field

I tried it with the following code, but Obviously I made a mistake

For Each ele In doc.getElementsByClassName("listingProfile_details")
txt = ele.parentElement.innerText

If Left(txt, 8) = "Location" Then
    location = Trim(Mid(txt, InStrRev(txt, ":") + 1))
ElseIf Left(txt, 4) = "Inventory" Then
    inventory = Trim(Mid(txt, InStrRev(txt, ":") + 1))
.
.
.
End If

Next ele

I hope that someone can show me the correct VBA code to check for all 18 possible data labels and the respective data

Thank you so much! Tony

Upvotes: 2

Views: 298

Answers (1)

QHarr
QHarr

Reputation: 84465

One way it to gather a nodeList of the dt/dd elements and loop it with a step 2 so you can access the label at n indices and the value at n + 1.

To handle differing numbers of labels being present, you can initialise a fresh dictionary, with all the possible labels as keys, and the associated values as vbNullString, during the loop over urls, such that for each new XHR request you get a new dictionary ready to populate with the labels that are found. By using .Exists test, you only update the values for keys (labels) that are found at the current URI.

You can store all results in an array to write out to the sheet in one go at end.

There are lots of additional notes within the code.


Option Explicit

Public Sub GetDetailedBizBuySellInfo()
    Dim http As Object, urls() As Variant
    Dim html As MSHTML.HTMLDocument              'VBE > Tools > References > Microsoft HTML Object Library
    
    urls = Array("https://www.bizbuysell.com/Business-Opportunity/covid-friendly-commercial-cleaning-est-30-years-100k-net/1753433/?d=L2Zsb3JpZGEvaGlsbHNib3JvdWdoLWNvdW50eS1idXNpbmVzc2VzLWZvci1zYWxlLzI/cT1hVEk5T0RFc01qQXNNekFzTnpnbWJtRndQV1UlM0Q=", _
                 "https://www.bizbuysell.com/Business-Opportunity/Established-Cleaning-Business-Tampa-St-Pete/1849521/?utm_source=bizbuysell&utm_medium=emailsite&utm_campaign=shtmlbot&utm_content=headline")
                 
    Set http = CreateObject("MSXML2.XMLHTTP")
    Set html = New MSHTML.HTMLDocument
    
    Dim url As Long, results() As Variant
             
    ReDim results(1 To UBound(urls) + 1, 1 To 19) 'size the final output array. _
                                                  There will be the number of urls as row count, the number of labels as column count + 1 to store the url itself. You need to update the list of labels below. See GetBlankDetailedInformationDictionary   
    With http
    
        For url = LBound(urls) To UBound(urls)   'loop url list
            
            .Open "Get", urls(url), False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            
            html.body.innerHTML = .responseText
            
            Dim currentDetailedInformation As Scripting.Dictionary 'VBE > Tools > References > Microsoft Scripting Runtime
            
            Set currentDetailedInformation = GetCurrentDetailedInfo(html) 'use retrieved html to return a dictionary with key as dt > strong e.g.Location; value as dd e.g. Tampa, FL
   
            AddCurrentDetailedInfoToResults results, currentDetailedInformation, url, urls(url) 'url + 1 (zero indexed) will keep track of current row number to add to results
        Next
        
    End With
    
    With ActiveSheet                             'better to update with explicit sheet/be careful not to overwrite data already in a sheet
        .Cells(1, 1).Resize(1, UBound(results, 2)) = currentDetailedInformation.keys ' write out headers
        .Cells(1, UBound(results, 2)) = "Url"
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results ' write out results
    End With
End Sub

Public Sub AddCurrentDetailedInfoToResults(ByRef results As Variant, ByVal currentDetailedInformation As Scripting.Dictionary, ByVal url As Long, ByVal currentUrl As String)
    
    Dim key As Variant, currentColumn As Long

    For Each key In currentDetailedInformation.keys
        currentColumn = currentColumn + 1        'increase column count to update results array with
        results(url + 1, currentColumn) = currentDetailedInformation(key)
    Next
    results(url + 1, currentColumn + 1) = currentUrl
End Sub

Public Function GetCurrentDetailedInfo(ByVal html As MSHTML.HTMLDocument) As Scripting.Dictionary
    ' Gathers a list of all the relevant dd, dt nodes within the passed in HTMLDocument.
    ' Requests a new blank dictionary whose keys are the labels (child strong element of dt tag)
    'Updates blank dictionary, per key, where present, with dd value in a loop of step 2 as list is strong, dd, strong, dd etc.....
    
    Dim updatedDictionary As Scripting.Dictionary, listOfLabelsAndValues As MSHTML.IHTMLDOMChildrenCollection
    
    Set updatedDictionary = GetBlankDetailedInformationDictionary
    'Css pattern to match the appropriate nodes
    Set listOfLabelsAndValues = html.querySelectorAll("#ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dt > strong, #ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dd")

    Dim currentIndex As Long
    
    For currentIndex = 0 To listOfLabelsAndValues.length - 2 Step 2 'nodeList is 0 index based
    
        'On Error Resume Next                     'key (label) may not be present for current html document _
        i.e. url so ignore errors when attempting to update blank dictionary via dt > strong matching on key. If label not found then value = vbNullString
        Dim key As String, value As String
        
        key = Trim$(listOfLabelsAndValues.Item(currentIndex).innerText)
        value = Trim$(listOfLabelsAndValues.Item(currentIndex + 1).innerText) 'as we are looping every 2 indices 0,2,4 ....
        If updatedDictionary.Exists(key) Then updatedDictionary(key) = value
        
        'On Error GoTo 0
    Next
    
    Set GetCurrentDetailedInfo = updatedDictionary ' return updated dictionary

End Function

Public Function GetBlankDetailedInformationDictionary() As Scripting.Dictionary

    Dim blankDictionary As Scripting.Dictionary, keys() As Variant, key As Long

    Set blankDictionary = New Scripting.Dictionary

    '' TODO Note: you would add in all 18 labels into array below.
    keys = Array("Location:", "Type:", "Inventory:", "Real Estate:", "Building SF:", _
                 "Building Status:", "Lease Expiration:", "Employees:", "Furniture, Fixtures, & Equipment (FF&E):", _
                 "Facilities:", "Competition:", "Growth & Expansion:", "Financing:", "Support & Training:", _
                 "Reason for Selling:", "Franchise:", "Home-Based:", "Business Website:")

    For key = LBound(keys) To UBound(keys)
        blankDictionary(keys(key)) = vbNullString 'add blank entry to dictionary for each label
    Next
    
    Set GetBlankDetailedInformationDictionary = blankDictionary
End Function

Upvotes: 1

Related Questions