Reputation: 57
I am trying to scrape Project Gutenberg.
I am able to use the .getElementsByClassName("chapter") to get the divs that hold the chapters. However, I am unable to get all the elements in that div as a collection that I would then be able to iterate over.
Sub getZ()
Dim H As Object, C As New DataObject, stryn&, cptr%, html As New HTMLDocument, p As HTMLHtmlElement, para As Object, i&
Set H = CreateObject("WinHTTP.WinHTTPRequest.5.1")
Application.ScreenUpdating = False
With H
.SetAutoLogonPolicy 0
.SetTimeouts 0, 0, 0, 0
.Open "GET", "https://www.gutenberg.org/files/8164/8164-h/8164-h.htm", False
.Send
.WaitForResponse
End With
html.body.innerHTML = H.ResponseText
Set para = html.getElementsByClassName("chapter").getElementsByTagName("*")
i = 1
For Each p In para
Worksheets("Output").Range("A" & i & "") = p.innerText
i = i + 1
Next
Application.ScreenUpdating = True
End Sub
I am getting an error with getElementsByTagName("*") as the object doesn't support that method.
Upvotes: 2
Views: 1654
Reputation: 84465
Cleaner, and faster, would be to combine your requirements (all children of a class) using a css query, and then loop the returned nodeList e.g.
With html.querySelectorAll(".chapter > *")
For i = 0 To .Length - 1
Worksheets("Output").Range("A" & i + 1) = .Item(i).innerText
Next
End With
Upvotes: 2
Reputation: 8741
Your code does not work as html.getElementsByClassName("chapter") gets an Object/DispHTMLElementCollection (like an array), it has not a method getElementsByTagName(). But an Object/HTMLDivElement has it. So this will work:
Option Explicit
Sub getZ()
Dim H As Object, C As New DataObject, stryn&, cptr%, html As New HTMLDocument, p As HTMLHtmlElement, para As Object, i&
Dim objChapters As Object, objChapter1 As Object
Set H = CreateObject("WinHTTP.WinHTTPRequest.5.1")
Application.ScreenUpdating = False
With H
.SetAutoLogonPolicy 0
.SetTimeouts 0, 0, 0, 0
.Open "GET", "https://www.gutenberg.org/files/8164/8164-h/8164-h.htm", False
.Send
.WaitForResponse
End With
html.Body.innerHTML = H.responseText
'Set para = html.getElementsByClassName("chapter").getElementsByTagName("*")
Set objChapters = html.getElementsByClassName("chapter")
i = 1
For Each objChapter1 In objChapters
Set para = objChapter1.getElementsByTagName("*")
For Each p In para
Worksheets("Output").Range("A" & i & "") = p.innerText
i = i + 1
Next
Next
Application.ScreenUpdating = True
'
Set objChapters = Nothing
Set objChapter1 = Nothing
Set para = Nothing
Set p = Nothing
Set html = Nothing
Set H = Nothing
End Sub
This gets all child elements of class 'chapter'.
Upvotes: 0