Reputation: 41
I am struggling with scraping some data from a website. I have stepped through and for some reason the Question.className does not match what I have and what I have checked on the inspector on the web page. When I say match, nothing seems to be assigned. I have just started to look at data scraping and would appreciate any tips. I haven't included all code as the below code is the where the error occurs. Thanks in advance and apologise if this has been asked before but I couldn't find anything after having a search so thought I would post.
The URL is - https://stackoverflow.com
Range("A3").Value = "Question id" 'put heading across the top of row 3
Range("B3").Value = "Votes"
Range("C3").Value = "Views"
Range("D3").Value = "Person"
Dim QuestionList As IHTMLElement
Dim Questions As IHTMLElementCollection
Dim Question As IHTMLElement
Dim RowNumber As Long
Dim QuestionId As String
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
Dim votes As String
Dim views As String
Dim QuestionFieldLinks As IHTMLElementCollection
Set QuestionList = html.getElementById("question-mini-list")
Set Questions = QuestionList.Children
RowNumber = 4
For Each Question In Questions
'if this is the tag containing the question details, process it
If Question.className = "question-summary narrow" Then
'first get and store the question id in first column
QuestionId = Replace(Question.ID, "question-summary-", "")
Cells(RowNumber, 1).Value = CLng(QuestionId)
'get a list of all of the parts of this question, and loop over them
Set QuestionFields = Question.all
For Each QuestionField In QuestionFields
'if this is the question's votes, store it (get rid of any surrounding text)
If QuestionField.className = "votes" Then
votes = Replace(QuestionField.innerText, "votes", "")
votes = Replace(votes, "vote", "")
Cells(RowNumber, 2).Value = Trim(votes)
End If
'likewise for views (getting rid of any text)
If QuestionField.className = "views" Then
views = QuestionField.innerText
views = Replace(views, "views", "")
views = Replace(views, "view", "")
Cells(RowNumber, 3).Value = Trim(views)
End If
'if this is the bit where author's name is ...
If QuestionField.className = "started" Then
'get a list of all elements within, and store the text in the second one
Set QuestionFieldLinks = QuestionField.all
Cells(RowNumber, 4).Value = QuestionFieldLinks(2).innerHTML
End If
Next QuestionField
'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next
Upvotes: 0
Views: 227
Reputation: 22440
Try this. It should fetch you the required fields:
Sub GetInformation()
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim post As HTMLDivElement, URL$, R&
URL = "https://stackoverflow.com/"
With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With
For Each post In Html.getElementsByClassName("question-summary")
R = R + 1: Cells(R, 1) = Split(post.getAttribute("id"), "-")(2)
Cells(R, 2) = Split(post.querySelector(".votes span").getAttribute("title"), " ")(0)
Cells(R, 3) = Split(post.querySelector(".views span").getAttribute("title"), " ")(0)
Next post
End Sub
Reference to add to the library:
Microsoft XML, v6.0
Microsoft HTML Object Library
Upvotes: 2