Jeremy
Jeremy

Reputation: 1337

Scraping innerHTML from sites using VBA

I'm trying to declare an array of nodes (which isn't a problem) and then scrape the innerHTML of two child nodes within each element of the array - taking SE as an example (an using the IE object method), assume I was trying to scrape the Title and extract of questions on the home page, there is an array of node (class name: "question-summary").

There's then two children nodes (The tile - class name: "question-hyperlink" and the extract - class name: "excerpt") the code I'm using is a follows:

Sub Scraper()
Dim ie As Object
Dim doc As Object, oQuestionShells As Object, oQuestionTitle As Object, oQuestion As Object, oElement As Object
Dim QuestionShell As String, QuestionTitle As String, Question As String, sURL As String

Set ie = CreateObject("internetexplorer.application")
sURL = "https://stackoverflow.com/questions/tagged/excel-formula"

QuestionShell = "question-summary"
QuestionTitle = "question-hyperlink"
Question = "excerpt"

With ie
    .Visible = False
    .Navigate sURL
End With

Set doc = ie.Document 'Stepping through so doc is getting assigned (READY_STATE = 4)

Set oQuestionShells = doc.getElementsByClassName(QuestionShell)

For Each oElement In oQuestionShells
    Set oQuestionTitle = oElement.getElementByClassName(QuestionTitle) 'Assigning this object causes an "Object doesn't support this property or method"
    Set oQuestion = oElement.getElementByClassName(Question) 'Assigning this object causes an "Object doesn't support this property or method"
    Debug.Print oQuestionTitle.innerHTML
    Debug.Print oQuestion.innerHTML
Next

End Sub

Upvotes: 2

Views: 2287

Answers (1)

Robin Mackenzie
Robin Mackenzie

Reputation: 19299

getElementByClassName is not a method.

You can only use getElementsByClassName (note the plural in the method name) which returns a IHTMLElementCollection.

Using an Object in place of IHTMLElementCollection is fine - but you still have to access a specific element in the collection by giving an index.

Let's assume that for each oElement that there is only one instance of the class question-summary and one instance of the class question-hyperlink. Then you can just use getElementsByClassName and use (0) at the end to pull out the first element of the array that is returned.

So your code correction is:

Set oQuestionTitle = oElement.getElementsByClassName(QuestionTitle)(0)
Set oQuestion = oElement.getElementsByClassName(Question)(0)

Full working code (with a few updates i.e. use Option Explicit and wait for the page to load):

Option Explicit

Sub Scraper()

    Dim ie As Object
    Dim doc As Object, oQuestionShells As Object, oQuestionTitle As Object, oQuestion As Object, oElement As Object
    Dim QuestionShell As String, QuestionTitle As String, Question As String, sURL As String

    Set ie = CreateObject("internetexplorer.application")
    sURL = "https://stackoverflow.com/questions/tagged/excel-formula"

    QuestionShell = "question-summary"
    QuestionTitle = "question-hyperlink"
    Question = "excerpt"

    With ie
        .Visible = True
        .Navigate sURL
        Do
            DoEvents
        Loop While .ReadyState < 4 Or .Busy
    End With

    Set doc = ie.Document

    Set oQuestionShells = doc.getElementsByClassName(QuestionShell)

    For Each oElement In oQuestionShells
        'Debug.Print TypeName(oElement)

        Set oQuestionTitle = oElement.getElementsByClassName(QuestionTitle)(0)
        Set oQuestion = oElement.getElementsByClassName(Question)(0)

        Debug.Print oQuestionTitle.innerHTML
        Debug.Print oQuestion.innerHTML
    Next

    ie.Quit

End Sub

Upvotes: 2

Related Questions