Reputation: 79
I am getting started with trying to learn about scraping. I got this page that is behind a login and I remember reading that you should not try to do the (1), (2) or (3) thing after get element by tagname. But that you should rather get something more unique like a Classname or ID. But can someone please tell me why
This the GetTag works and
Dim Companyname As String
Companyname = ie.document.getElementsByTagName("span")(1).innertext
This GetClass do not work
Dim Companyname As String
Companyname = ie.document.getElementsByClassName("account-website-name").innertext
This is the text that I am scraping
<span class="account-website-name" data-journey-name="true">Dwellington Journey</span>
Upvotes: 2
Views: 1866
Reputation: 84475
CSS selector:
You can achieve the same thing with a CSS selector of .account-website-name
The "." means className. This will return a collection of matching elements if there are more than one.
CSS query:
VBA:
You apply the selector with the .querySelectorAll
method of .document
. This returns a nodeList
which you traverse the .Length
of, accessing items by index, starting from 0.
Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll(".account-website-name")
For i = 0 To aNodeList.Length -1
Debug.Print aNodeList.Item(i).innerText
' Debug.Print aNodeList(i).innerText ''<== sometimes this syntax instead
Next
Upvotes: 1
Reputation:
getELEMENTbyProperty vs getELEMENTSbyProperty
There are primarily two distinct types of commands to retrieve one or more elements from a web page's .Document
; those that return a single object and those that return a collection of objects.
Getting an ELEMENT
When getElementById
is used, you are asking for a single object (e.g. MSHTML.IHTMLElement
). In this case the properties (e.g. .Value
, .innerText
, .outerHtml
, etc) can be retrieved directly. There isn't supposed to be more than a single unique id property within an HTML body so this function should safely return the only element within the i.e.document
that matches.
'typical VBA use of getElementById
Dim CompanyName As String
CompanyName = ie.document.getElementById("CompanyID").innerText
Caveat: I've noticed a growing number of web designers who seem to think that using the same id
for multiple elements is oh-key-doh-key as long as the id's are within different parent elements like different <div>
elements. AFAIK, this is patently wrong but seems to be a growing practise. Be careful on what is returned when using .getElementById
.
Getting ELEMENTS
When using getElementsByTagName
, getElementsByClassName
, etc. where the word Elements is plural, you are returning a collection (e.g. MSHTML.IHTMLElementCollection
) of objects, even if that collection contains only one or even none. If you want to use these to directly access an property of one of the elements within the collection, an ordinal index number must be supplied so that a single element within the collection is referenced. The index number within these collections is zero based (i.e. the first starts at (0)).
'retrieve the text from the third <span> element on a webpage
Dim CompanyName As String
CompanyName = ie.document.getElementsByTagName("span")(2).innerText
'output all <span> classnames to the Immediate window until the right one comes along
'retrieve the text from the first <span> element with a classname of 'account-website-name'
Dim e as long, es as long
es = ie.document.getElementsByTagName("span").Length - 1
For e = 0 To es
Debug.Print ie.document.getElementsByTagName("span")(e).className
If ie.document.getElementsByTagName("span")(e).className = "account-website-name" Then
CompanyName = ie.document.getElementsByTagName("span")(e).innerText
Exit For
End If
Next e
'same thing, different method
Dim eSPN as MSHTML.IHTMLElement, ecSPNs as MSHTML.IHTMLElementCollection
ecSPNs = ie.document.getElementsByTagName("span")
For Each eSPN in ecSPNs
Debug.Print eSPN.className
If eSPN.className = "account-website-name" Then
CompanyName = eSPN.innerText
Exit For
End If
Next eSPN
Set eSPN = Nothing: Set ecSPNs = Nothing
To summarize, if your Internet.Explorer
method uses Elements (plural) rather than Element (singular), you are returning a collection which must have an index number appended if you wish to treat one of the elements within the collection as a single element.
Upvotes: 3