TheRunner83
TheRunner83

Reputation: 41

Website Scraping using VBA

I am struggling with scraping some data from a website. I have stepped through and for some reason the Question.className does not match what I have and what I have checked on the inspector on the web page. When I say match, nothing seems to be assigned. I have just started to look at data scraping and would appreciate any tips. I haven't included all code as the below code is the where the error occurs. Thanks in advance and apologise if this has been asked before but I couldn't find anything after having a search so thought I would post.

The URL is - https://stackoverflow.com

Range("A3").Value = "Question id" 'put heading across the top of row 3
Range("B3").Value = "Votes"
Range("C3").Value = "Views"
Range("D3").Value = "Person"

Dim QuestionList As IHTMLElement
Dim Questions As IHTMLElementCollection
Dim Question As IHTMLElement
Dim RowNumber As Long
Dim QuestionId As String
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
Dim votes As String
Dim views As String
Dim QuestionFieldLinks As IHTMLElementCollection

Set QuestionList = html.getElementById("question-mini-list")
Set Questions = QuestionList.Children

RowNumber = 4

For Each Question In Questions
'if this is the tag containing the question details, process it
If Question.className = "question-summary narrow" Then
'first get and store the question id in first column
QuestionId = Replace(Question.ID, "question-summary-", "")
Cells(RowNumber, 1).Value = CLng(QuestionId)

'get a list of all of the parts of this question, and loop over them
Set QuestionFields = Question.all

For Each QuestionField In QuestionFields
'if this is the question's votes, store it (get rid of any surrounding text)

    If QuestionField.className = "votes" Then
        votes = Replace(QuestionField.innerText, "votes", "")
        votes = Replace(votes, "vote", "")
        Cells(RowNumber, 2).Value = Trim(votes)
    End If

    'likewise for views (getting rid of any text)
    If QuestionField.className = "views" Then
        views = QuestionField.innerText
        views = Replace(views, "views", "")
        views = Replace(views, "view", "")
        Cells(RowNumber, 3).Value = Trim(views)
    End If

    'if this is the bit where author's name is ...
    If QuestionField.className = "started" Then
    'get a list of all elements within, and store the text in the second one
        Set QuestionFieldLinks = QuestionField.all
        Cells(RowNumber, 4).Value = QuestionFieldLinks(2).innerHTML
    End If
Next QuestionField
'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next

Upvotes: 0

Views: 227

Answers (1)

SIM
SIM

Reputation: 22440

Try this. It should fetch you the required fields:

Sub GetInformation()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim post As HTMLDivElement, URL$, R&

    URL = "https://stackoverflow.com/"

    With Http
        .Open "GET", URL, False
        .send
        Html.body.innerHTML = .responseText
    End With

    For Each post In Html.getElementsByClassName("question-summary")
        R = R + 1: Cells(R, 1) = Split(post.getAttribute("id"), "-")(2)
        Cells(R, 2) = Split(post.querySelector(".votes span").getAttribute("title"), " ")(0)
        Cells(R, 3) = Split(post.querySelector(".views span").getAttribute("title"), " ")(0)
    Next post
End Sub

Reference to add to the library:

Microsoft XML, v6.0
Microsoft HTML Object Library

Upvotes: 2

Related Questions