Niklas Andersson
Niklas Andersson

Reputation: 79

VBA Excel Scraping

I am getting started with trying to learn about scraping. I got this page that is behind a login and I remember reading that you should not try to do the (1), (2) or (3) thing after get element by tagname. But that you should rather get something more unique like a Classname or ID. But can someone please tell me why

This the GetTag works and

Dim Companyname As String
Companyname = ie.document.getElementsByTagName("span")(1).innertext

This GetClass do not work

Dim Companyname As String
Companyname = ie.document.getElementsByClassName("account-website-name").innertext

This is the text that I am scraping

<span class="account-website-name" data-journey-name="true">Dwellington Journey</span>

Upvotes: 2

Views: 1866

Answers (2)

QHarr
QHarr

Reputation: 84475

CSS selector:

You can achieve the same thing with a CSS selector of .account-website-name

The "." means className. This will return a collection of matching elements if there are more than one.


CSS query:

CSS query


VBA:

You apply the selector with the .querySelectorAll method of .document. This returns a nodeList which you traverse the .Length of, accessing items by index, starting from 0.

Dim aNodeList As Object, i As Long
Set aNodeList = ie.document.querySelectorAll(".account-website-name")

For i = 0 To aNodeList.Length -1
    Debug.Print aNodeList.Item(i).innerText
    ' Debug.Print aNodeList(i).innerText ''<== sometimes this syntax instead
Next

Upvotes: 1

user4039065
user4039065

Reputation:

getELEMENTbyProperty vs getELEMENTSbyProperty

There are primarily two distinct types of commands to retrieve one or more elements from a web page's .Document; those that return a single object and those that return a collection of objects.

Getting an ELEMENT

When getElementById is used, you are asking for a single object (e.g. MSHTML.IHTMLElement). In this case the properties (e.g. .Value, .innerText, .outerHtml, etc) can be retrieved directly. There isn't supposed to be more than a single unique id property within an HTML body so this function should safely return the only element within the i.e.document that matches.

'typical VBA use of getElementById
Dim CompanyName As String
CompanyName = ie.document.getElementById("CompanyID").innerText

Caveat: I've noticed a growing number of web designers who seem to think that using the same id for multiple elements is oh-key-doh-key as long as the id's are within different parent elements like different <div> elements. AFAIK, this is patently wrong but seems to be a growing practise. Be careful on what is returned when using .getElementById.

Getting ELEMENTS

When using getElementsByTagName, getElementsByClassName, etc. where the word Elements is plural, you are returning a collection (e.g. MSHTML.IHTMLElementCollection) of objects, even if that collection contains only one or even none. If you want to use these to directly access an property of one of the elements within the collection, an ordinal index number must be supplied so that a single element within the collection is referenced. The index number within these collections is zero based (i.e. the first starts at (0)).

'retrieve the text from the third <span> element on a webpage
Dim CompanyName As String
CompanyName = ie.document.getElementsByTagName("span")(2).innerText

'output all <span> classnames to the Immediate window until the right one comes along
'retrieve the text from the first <span> element with a classname of 'account-website-name'
Dim e as long, es as long
es = ie.document.getElementsByTagName("span").Length - 1
For e = 0 To es
    Debug.Print ie.document.getElementsByTagName("span")(e).className
    If ie.document.getElementsByTagName("span")(e).className = "account-website-name" Then
        CompanyName = ie.document.getElementsByTagName("span")(e).innerText
        Exit For
    End If
Next e

'same thing, different method
Dim eSPN as MSHTML.IHTMLElement, ecSPNs as MSHTML.IHTMLElementCollection
ecSPNs = ie.document.getElementsByTagName("span")
For Each eSPN in ecSPNs
    Debug.Print eSPN.className
    If eSPN.className = "account-website-name" Then
        CompanyName = eSPN.innerText
        Exit For
    End If
Next eSPN 
Set eSPN = Nothing: Set ecSPNs = Nothing

To summarize, if your Internet.Explorer method uses Elements (plural) rather than Element (singular), you are returning a collection which must have an index number appended if you wish to treat one of the elements within the collection as a single element.

Upvotes: 3

Related Questions