
Reputation: 5100

Use getElementById on HTMLElement instead of HTMLDocument

I've been playing around with scraping data from web pages using VBS/VBA.

If it were Javascript I'd be away as its easy, but it doesn't seem to be quite as straight forward in VBS/VBA.

This is an example I made for an answer, it works but I had planned on accessing the child nodes using getElementByTagName but I could not figure out how to use them! The HTMLElement object does not have those methods.

Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement

Set Browser = New InternetExplorer

Browser.navigate ""

Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE

Set Document = Browser.Document

Set Elements = Document.getElementsByClassName("profile-col1")

For Each Element in Elements
    Debug.Print "[  name] " & Trim(Element.Children(1).Children(0).innerText)
    Debug.Print "[ title] " & Trim(Element.Children(1).Children(1).innerText)
Next Element

Set Document = Nothing
Set Browser = Nothing
End Sub

I have been looking at the HTMLElement.document property, seeing if it is like a fragment of the document but its either difficult to work with or just isnt what I think

Dim Fragment As HTMLDocument
Set Element = Document.getElementById("example") ' This works
Set Fragment = Element.document ' This doesn't

This also seems a long winded way to do it (although thats usually the way for vba imo). Anyone know if there is a simpler way to chain functions?

Document.getElementById("target").getElementsByTagName("tr") would be awesome...

Upvotes: 14

Views: 136140

Answers (4)


Reputation: 84465

I would use XMLHTTP request to retrieve page content as much faster. Then it is easy enough to use querySelectorAll to apply a CSS class selector to grab by class name. Then you access the child elements by tag name and index.

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As HTMLDocument, elements As Object, i As Long

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    Set html = New HTMLDocument
    With html
        .body.innerHTML = sResponse
        Set elements = .querySelectorAll(".profile-col1")
        For i = 0 To elements.Length - 1
            Debug.Print String(20, Chr$(61))
            Debug.Print elements.item(i).getElementsByTagName("a")(0).innerText
            Debug.Print elements.item(i).getElementsByTagName("p")(0).innerText
            Debug.Print elements.item(i).getElementsByTagName("p")(1).innerText
    End With
End Sub


VBE > Tools > References > Microsoft HTML Object Library

Upvotes: 2


Reputation: 160

Thanks to dee for the answer above with the Scrape() subroutine. The code worked perfectly as written, and I was able to then convert the code to work with the specific website I am trying to scrape.

I do not have enough reputation to upvote or to comment, but I do actually have some minor improvements to add to dee's answer:

  1. You will need to add the VBA Reference via "Tools\References" to "Microsoft HTML Object Library in order for the code to compile.

  2. I commented out the Browser.Visible line and added the comment as follows

    'if you need to debug the browser page, uncomment this line:
    'Browser.Visible = True
  3. And I added a line to close the browser before Set Browser = Nothing:


Thanks again dee!

ETA: this works on machines with IE9, but not machines with IE8. Anyone have a fix?

Found the fix myself, so came back here to post it. The ClassName function is available in IE9. For this to work in IE8, you use querySelectorAll, with a dot preceding the class name of the object you are looking for:

'Set repList = doc.getElementsByClassName("reportList") 'only works in IE9, not in IE8
Set repList = doc.querySelectorAll(".reportList")       'this works in IE8+

Upvotes: 1


Reputation: 14053

Sub Scrape()
    Dim Browser As InternetExplorer
    Dim Document As htmlDocument
    Dim Elements As IHTMLElementCollection
    Dim Element As IHTMLElement

    Set Browser = New InternetExplorer
    Browser.Visible = True
    Browser.navigate ""

    Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE

    Set Document = Browser.Document

    Set Elements = Document.getElementById("hmenus").getElementsByTagName("li")
    For Each Element In Elements
        Debug.Print Element.innerText
        'Ask Question
    Next Element

    Set Document = Nothing
    Set Browser = Nothing
End Sub

Upvotes: 13


Reputation: 2718

I don't like it either.

So use javascript:

Public Function GetJavaScriptResult(doc as HTMLDocument, jsString As String) As String

    Dim el As IHTMLElement
    Dim nd As HTMLDOMTextNode

    Set el = doc.createElement("INPUT")
        el.ID = GenerateRandomAlphaString(100)
    Loop Until Document.getElementById(el.ID) Is Nothing
    el.Style.display = "none"
    Set nd = Document.appendChild(el)

    doc.parentWindow.ExecScript "document.getElementById('" & el.ID & "').value = " & jsString

    GetJavaScriptResult = Document.getElementById(el.ID).Value

    Document.removeChild nd

End Function

Function GenerateRandomAlphaString(Length As Long) As String

    Dim i As Long
    Dim Result As String

    Randomize Timer

    For i = 1 To Length
        Result = Result & Chr(Int(Rnd(Timer) * 26 + 65 + Round(Rnd(Timer)) * 32))
    Next i

    GenerateRandomAlphaString = Result

End Function

Let me know if you have any problems with this; I've changed the context from a method to a function.

By the way, what version of IE are you using? I suspect you're on < IE8. If you upgrade to IE8 I presume it'll update shdocvw.dll to ieframe.dll and you will be able to use document.querySelector/All.


Comment response which isn't really a comment: Basically the way to do this in VBA is to traverse the child nodes. The problem is you don't get the correct return types. You could fix this by making your own classes that (separately) implement IHTMLElement and IHTMLElementCollection; but that's WAY too much of a pain for me to do it without getting paid :). If you're determined, go and read up on the Implements keyword for VB6/VBA.

Public Function getSubElementsByTagName(el As IHTMLElement, tagname As String) As Collection

    Dim descendants As New Collection
    Dim results As New Collection
    Dim i As Long

    getDescendants el, descendants

    For i = 1 To descendants.Count
        If descendants(i).tagname = tagname Then
            results.Add descendants(i)
        End If
    Next i

    getSubElementsByTagName = results

End Function

Public Function getDescendants(nd As IHTMLElement, ByRef descendants As Collection)
    Dim i As Long
    descendants.Add nd
    For i = 1 To nd.Children.Length
        getDescendants nd.Children.Item(i), descendants
    Next i
End Function

Upvotes: 5

Related Questions