Reputation: 77
Ok, this is the target webpage(s): http://dnd.arkalseif.info/items/index.html_page=27
Here's my current code:
Sub GetItemsList()
' This macro uses manually entered links to scrap the content of the target page.
' It does not (yet) capture hyperlinks, it only grabs text.
Dim ie As Object
Dim retStr As String
Dim sht As Worksheet
Dim LastRow As Long
Dim rCell As Range
Dim rRng As Range
Dim Count As Long
Dim Status As String
Dim BadCount As Long
Set sht = ThisWorkbook.Worksheets("List")
BadCount = 0
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set ie = CreateObject("internetexplorer.application")
Set rRng = sht.Range("b1:b" & LastRow)
Status = "Starting at row "
For Each rCell In rRng.Cells
Count = rCell.Row
Application.StatusBar = BadCount & " dead links so far. " & Status & Count & "of " & LastRow & "."
Wait 1
If rCell = "" Then
With ie
.Navigate rCell.Offset(0, -1).Value
.Visible = False
End With
Do While ie.Busy
DoEvents
Loop
Wait 1
On Error GoTo ErrHandler
' rCell.Value = ie.Document.getElementById("content").innerText
rCell.Value = ie.Document.getElementsByClassName("common").innerText
rCell.WrapText = False
Status = "This row successfully scraped. Moving on to row "
Application.StatusBar = BadCount & " dead links so far. " & Status & Count + 1 & "of " & LastRow & "."
Status = "Previous row succeded. Now at row "
98 Wait 1
End If
Next rCell
If BadCount > 0 Then
Application.StatusBar = "Macro finshed running with " & BadCount & " errors."
Else
Application.StatusBar = "Finished."
End If
Exit Sub
ErrHandler:
rCell.Value = ""
Status = "Previous row failed. Moving on to row "
BadCount = BadCount + 1
Application.StatusBar = "This row is a dead link. " & BadCount & " dead links so far. Moving on to row " & Count + 1 & "of " & LastRow & "."
Resume 98
End Sub
(try to ignore all my StatusBar updates, this code was originally meant for a looooong list of hyperlinks, and I needed (at the time) to know when things buggered up)
Now, the commented out line works, in that it grabs the entire body of text from the div id
Content. But I want to grab the hyperlinks nestled inside the first column of the table which is nested inside the div id
(which is what the following line was for). But it just fails. Excel does nothing, treats it like an error, and proceeds to the next link.
I presume that I need to tell Excel to look for the Table class
inside the Div id
. But I don't know how to do that, and I haven't been able to figure it out.
Thanks everyone.
Upvotes: 0
Views: 230
Reputation: 84465
I would use CSS selectors to target the links and XMLHTTP as a faster retrieval method than launching a browser.
CSS selectors:
The following:
td:first-child [href]
The td:first-child is a :first-child
CSS pseudo-class selector of td
tagged element; " "
is a descendant combinator selector, the []
is an attribute selector. Basically, it selects for the first td
element in each row in this case i.e. the first column, and then to the href
attribute element within.
The :first-child CSS pseudo-class represents the first element among a group of sibling elements.
Sadly VBA implementation doesn't support the :not
selector as the exact elements could also be matched with .common tr + tr td :not([href*='rule'],br)
. Support for pseudo selectors is very limited. In this case using a :nth-child()
CSS pseudo-class selector of td:nth-child(1)
would have retrieved specific items if supported in descendant combination as td:nth-child(1) [href]
. I keep meaning to do a write up on what is and isn't supported in case anyone wants as a reference. It is useful to be aware of even non VBA supported methods in case you then chose to switch to a language that does support.
The selector is applied via the querySelectorAll
method of, in this case, HTMLDocument
. It returns all matches as a nodeList
whose .Length
can be traversed to access individual matched elements by index.
nodeList items:
Option Explicit
Public Sub GetLinks()
Dim sResponse As String, html As HTMLDocument, nodeList As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://dnd.arkalseif.info/items/index.html_page=27", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set nodeList = .querySelectorAll("td:first-child [href]")
For i = 0 To nodeList.Length - 1
Debug.Print Replace$(nodeList.item(i), "about:", "http://dnd.arkalseif.info/items/")
Next
End With
End Sub
References (VBE > Tools > References):
Upvotes: 3