Reputation: 49
Once again I find myself encountering a very specific problem. I'm pretty new to VBA, and especially HTML, so bear with me. I've built a functioning web scraper in VBA, but there are a few specific tasks I want to accomplish that I can't figure out.
Here is the HTML sample that my question refers to.
I've replaced all the stuff that doesn't matter with ellipses. The important part that I want to scrape is in the "a" tags, the "data-shorturl" (or just the innerText
). This is a website name. There are up to five of these, but there are not always five. This is also just one of two section with up to five websites listed. The posted section has the children of <div class="referralsSites referring">
, and the other has the children of <div class="referralsSites destination">
.
Each website in the "referring" section I want to assign to "Up" variables -- the first website assigned to "Up1", the second to "Up2", and so on, but only depending on how many websites are in the "referring" section. I want to do the same in the "destination" section, but assigned to "Down" variables (Down1, Down2, etc.) depending on how many destination sites there are.
If I were to just use getElementsByClassName("websitePage-listItemLink js-tooltipTarget")
, for example, I wouldn't be able to differentiate between the referral and destination sites.
Here is my code so far:
Sub GetSimilarWebData()
Dim appIE As InternetExplorer
Dim HTML As HTMLDocument
Dim ieWindow As SHDocVw.InternetExplorer
Dim URL As String
Dim Rankings As IHTMLElementCollection, Traffic As IHTMLElementCollection, ReferSites As IHTMLElementCollection, DestSites As IHTMLElementCollection, _
rSite As IHTMLElement, rSiteNo As Long, dSite As IHTMLElement, dSiteNo As Long, GlobalRank As String, CountryName As String, CountryRank As String, _
Visits As String, Direct As String, Refer As String, Search As String, Social As String, Display As String, _
Up1 As String, Up2 As String, Up3 As String, Up4 As String, Up5 As String, _
D1 As String, D2 As String, D3 As String, D4 As String, D5 As String
Dim FraudLast As Long
CheckLast = Worksheets("Sheet1").Range("I1").End(xlDown).Offset(1).Row
webStr = Worksheets("Sheet1").Range("A" & CheckLast).Value
Set appIE = New InternetExplorer
appIE.Visible = False
appIE.navigate "https://www.similarweb.com/website/" & webStr
Do While appIE.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Connecting to SimilarWeb..."
DoEvents
Loop
Set HTML = appIE.document
Set appIE = Nothing
Application.StatusBar = ""
Set Rankings = HTML.getElementsByClassName("rankingItem-value")
GlobalRank = Rankings(0).innerText
If GlobalRank = "N/A" Then
GlobalRank = "null"
CountryName = "null"
CountryRank = "null"
Else
CountryName = HTML.getElementsByClassName("rankingItem-subTitle")(1).innerText
CountryRank = Rankings(1).innerText
End If
Visits = HTML.getElementsByClassName("engagementInfo-value engagementInfo-value--large u-text-ellipsis")(0).innerText
If InStr(Visits, "M") <> 0 Then
Visits = Replace(Visits, ".", "")
Visits = Replace(Visits, "M", "00000")
ElseIf InStr(Visits, "K") <> 0 Then
Visits = Replace(Visits, ".", "")
Visits = Replace(Visits, "K", "00")
ElseIf InStr(Visits, "B") <> 0 Then
Visits = Replace(Visits, ".", "")
Visits = Replace(Visits, "B", "00000000")
End If
Set Traffic = HTML.getElementsByClassName("trafficSourcesChart-value")
Direct = Traffic(0).innerText
Refer = Traffic(1).innerText
Search = Traffic(2).innerText
Social = Traffic(3).innerText
Display = Traffic(4).innerText
'Here's what I've started off with:
Set ReferSite = HTML.getElementsByClassName("referralsSites referring")
rSiteNo = ReferSite.Length
Set DestSite = HTML.getElementsByClassName("referralsSites destination")
dSiteNo = DestSite.Length
'For Each rSite In ReferSite
End Sub
I'm not really sure how to approach the problem. Everything else in my code works fine, but of course if there's anything I can do to improve the speed that would also be welcome.
All of this is referring to data on similarweb.com.
Upvotes: 1
Views: 11555
Reputation: 19289
The getElementsByClassName
method can be used on a IHTMLElement
object as well as HTMLDocument
object. This means you can get the separate lists of referral and destination sites in two 'hops'.
First get the <div>
s with the classname of referralsSites referring
or referralsSites destination
. The getElementsByClassName
method returns a IHTMLElementCollection
which is a collection of IHTMLElement
. So you get the 0th element of the collection (assuming there's only one <div>
) and then get the <a>
s within that <div>
with a class of websitePage-listItemLink
by calling getElementsByClassName
method again on the IHTMLElement
for the <div>
.
Here is an example for stackoverflow.com - I'm just doing Debug.Print
of the output but you might want to assign the site-names into an array, or Collection
or something.
Option Explicit
Sub Test()
'references required:
'Microsoft HTML Object Library
'Microsoft Internet Controls
Dim strUrl As String
Dim objIe As InternetExplorer
Dim objHtml As HTMLDocument
Dim strHtml As String
Dim objDivs As IHTMLElementCollection
Dim objAnchors As IHTMLElementCollection
Dim intCounter As Integer
'set target to scrape
strUrl = "https://www.similarweb.com/website/stackoverflow.com"
'get html from page
Set objIe = New InternetExplorer
objIe.Visible = False
objIe.navigate strUrl
While objIe.readyState <> READYSTATE_COMPLETE
DoEvents
Wend
'assign html to DOM document
Set objHtml = New HTMLDocument
Set objHtml = objIe.document
'get referrals
Set objDivs = objHtml.getElementsByClassName("referralsSites referring")
If objDivs.Length > 0 Then
Set objAnchors = objDivs(0).getElementsByClassName("websitePage-listItemLink")
Debug.Print "Referrers:"
If objAnchors.Length > 0 Then
For intCounter = 0 To objAnchors.Length - 1
Debug.Print objAnchors(intCounter).innerText
Next intCounter
End If
End If
'get destinations
Set objDivs = objHtml.getElementsByClassName("referralsSites destination")
If objDivs.Length > 0 Then
Set objAnchors = objDivs(0).getElementsByClassName("websitePage-listItemLink")
Debug.Print "Destinations:"
If objAnchors.Length > 0 Then
For intCounter = 0 To objAnchors.Length - 1
Debug.Print objAnchors(intCounter).innerText
Next intCounter
End If
End If
'clean up
Set objHtml = Nothing
objIe.Quit
Set objIe = Nothing
End Sub
This gives an output of:
Referrers:
news.ycombinator.com
qwant.com
github.com
remoteok.io
serverfault.com
Destinations:
jsfiddle.net
youtube.com
github.com
i.sstatic.net
w3schools.com
Upvotes: 2