Reputation: 131
My system is working with Windows 10 64bit and Office 2016
I am a bit familiar with Excel VBA, and I am trying to learn web scraping using VBA. Unfortunately, there is little information on in-depth digging on web data.
The website I want to scrape data from is bizbuysell.com from the seller's offers such as
There is a section that starts with the headline Detailed Information The HTML code is:
<h3>Detailed Information</h3>
I want to scrape data from this section.
The problem is that there are some 18 data labels and their respective values possible, but only those are shown for which the seller has entered data.
My idea was to search for all possible data labels and if they are not available then next data field
I tried it with the following code, but Obviously I made a mistake
For Each ele In doc.getElementsByClassName("listingProfile_details")
txt = ele.parentElement.innerText
If Left(txt, 8) = "Location" Then
location = Trim(Mid(txt, InStrRev(txt, ":") + 1))
ElseIf Left(txt, 4) = "Inventory" Then
inventory = Trim(Mid(txt, InStrRev(txt, ":") + 1))
.
.
.
End If
Next ele
I hope that someone can show me the correct VBA code to check for all 18 possible data labels and the respective data
Thank you so much! Tony
Upvotes: 2
Views: 298
Reputation: 84465
One way it to gather a nodeList of the dt/dd elements and loop it with a step 2 so you can access the label at n indices and the value at n + 1.
To handle differing numbers of labels being present, you can initialise a fresh dictionary, with all the possible labels as keys, and the associated values as vbNullString
, during the loop over urls, such that for each new XHR request you get a new dictionary ready to populate with the labels that are found. By using .Exists test, you only update the values for keys (labels) that are found at the current URI.
You can store all results in an array to write out to the sheet in one go at end.
There are lots of additional notes within the code.
Option Explicit
Public Sub GetDetailedBizBuySellInfo()
Dim http As Object, urls() As Variant
Dim html As MSHTML.HTMLDocument 'VBE > Tools > References > Microsoft HTML Object Library
urls = Array("https://www.bizbuysell.com/Business-Opportunity/covid-friendly-commercial-cleaning-est-30-years-100k-net/1753433/?d=L2Zsb3JpZGEvaGlsbHNib3JvdWdoLWNvdW50eS1idXNpbmVzc2VzLWZvci1zYWxlLzI/cT1hVEk5T0RFc01qQXNNekFzTnpnbWJtRndQV1UlM0Q=", _
"https://www.bizbuysell.com/Business-Opportunity/Established-Cleaning-Business-Tampa-St-Pete/1849521/?utm_source=bizbuysell&utm_medium=emailsite&utm_campaign=shtmlbot&utm_content=headline")
Set http = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
Dim url As Long, results() As Variant
ReDim results(1 To UBound(urls) + 1, 1 To 19) 'size the final output array. _
There will be the number of urls as row count, the number of labels as column count + 1 to store the url itself. You need to update the list of labels below. See GetBlankDetailedInformationDictionary
With http
For url = LBound(urls) To UBound(urls) 'loop url list
.Open "Get", urls(url), False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
Dim currentDetailedInformation As Scripting.Dictionary 'VBE > Tools > References > Microsoft Scripting Runtime
Set currentDetailedInformation = GetCurrentDetailedInfo(html) 'use retrieved html to return a dictionary with key as dt > strong e.g.Location; value as dd e.g. Tampa, FL
AddCurrentDetailedInfoToResults results, currentDetailedInformation, url, urls(url) 'url + 1 (zero indexed) will keep track of current row number to add to results
Next
End With
With ActiveSheet 'better to update with explicit sheet/be careful not to overwrite data already in a sheet
.Cells(1, 1).Resize(1, UBound(results, 2)) = currentDetailedInformation.keys ' write out headers
.Cells(1, UBound(results, 2)) = "Url"
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results ' write out results
End With
End Sub
Public Sub AddCurrentDetailedInfoToResults(ByRef results As Variant, ByVal currentDetailedInformation As Scripting.Dictionary, ByVal url As Long, ByVal currentUrl As String)
Dim key As Variant, currentColumn As Long
For Each key In currentDetailedInformation.keys
currentColumn = currentColumn + 1 'increase column count to update results array with
results(url + 1, currentColumn) = currentDetailedInformation(key)
Next
results(url + 1, currentColumn + 1) = currentUrl
End Sub
Public Function GetCurrentDetailedInfo(ByVal html As MSHTML.HTMLDocument) As Scripting.Dictionary
' Gathers a list of all the relevant dd, dt nodes within the passed in HTMLDocument.
' Requests a new blank dictionary whose keys are the labels (child strong element of dt tag)
'Updates blank dictionary, per key, where present, with dd value in a loop of step 2 as list is strong, dd, strong, dd etc.....
Dim updatedDictionary As Scripting.Dictionary, listOfLabelsAndValues As MSHTML.IHTMLDOMChildrenCollection
Set updatedDictionary = GetBlankDetailedInformationDictionary
'Css pattern to match the appropriate nodes
Set listOfLabelsAndValues = html.querySelectorAll("#ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dt > strong, #ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dd")
Dim currentIndex As Long
For currentIndex = 0 To listOfLabelsAndValues.length - 2 Step 2 'nodeList is 0 index based
'On Error Resume Next 'key (label) may not be present for current html document _
i.e. url so ignore errors when attempting to update blank dictionary via dt > strong matching on key. If label not found then value = vbNullString
Dim key As String, value As String
key = Trim$(listOfLabelsAndValues.Item(currentIndex).innerText)
value = Trim$(listOfLabelsAndValues.Item(currentIndex + 1).innerText) 'as we are looping every 2 indices 0,2,4 ....
If updatedDictionary.Exists(key) Then updatedDictionary(key) = value
'On Error GoTo 0
Next
Set GetCurrentDetailedInfo = updatedDictionary ' return updated dictionary
End Function
Public Function GetBlankDetailedInformationDictionary() As Scripting.Dictionary
Dim blankDictionary As Scripting.Dictionary, keys() As Variant, key As Long
Set blankDictionary = New Scripting.Dictionary
'' TODO Note: you would add in all 18 labels into array below.
keys = Array("Location:", "Type:", "Inventory:", "Real Estate:", "Building SF:", _
"Building Status:", "Lease Expiration:", "Employees:", "Furniture, Fixtures, & Equipment (FF&E):", _
"Facilities:", "Competition:", "Growth & Expansion:", "Financing:", "Support & Training:", _
"Reason for Selling:", "Franchise:", "Home-Based:", "Business Website:")
For key = LBound(keys) To UBound(keys)
blankDictionary(keys(key)) = vbNullString 'add blank entry to dictionary for each label
Next
Set GetBlankDetailedInformationDictionary = blankDictionary
End Function
Upvotes: 1