Smith O.
Smith O.

Reputation: 217

Exploring the Instr VBA Function In Webscraping

I want to scrape this URL https://www.realtor.com/realestateandhomes-search/06510 using the VBA InStr function and extract all URLs with this substring "06510"

Here's is a sample code I've been trying to make work.

Option Explicit

Sub GetLinks()


    '
    'To use HTMLDocument you need to set a reference to Tools -> References -> Microsoft HTML Object Library
    Dim HTML As New HTMLDocument
    Dim http As Object
    Dim links As Object
    Dim link As HTMLHtmlElement
    Dim counter As Long
    Dim website As Range
    Dim LastRange As Range
    Dim row As Long
    Dim continue As Boolean
    Dim respHead As String
    Dim lRow As Long

    Application.ScreenUpdating = False

    ' The row where website addresses start
    row = 30
    continue = True

    lRow = Cells(Rows.count, 1).End(xlUp).row + 1

    ' XMLHTTP gives errors where ServerXMLHTTP does not
    ' even when using the same URL's
    'Set http = CreateObject("MSXML2.XMLHTTP")

    Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")

    Do While continue

        ' Could set this to first cell with URL then OFFSET columns to get next web site
        Set website = Range("A" & row)

        Set LastRange = Range("B" & lRow)

        If Len(website.Value) < 1 Then

            continue = False
            Exit Sub

        End If

        If website Is Nothing Then

            continue = False

        End If

        'Debug.Print website

        With http

            On Error Resume Next
            .Open "GET", website.Value, False
            .send

            ' If Err.Num is not 0 then an error occurred accessing the website
            ' This checks for badly formatted URL's. The website can still return an error
            ' which should be checked in .Status

            'Debug.Print Err.Number

            ' Clear the row of any previous results
            Range("B" & row & ":e" & row).Clear

            ' If the website sent a valid response to our request
            If Err.Number = 0 Then


                If .Status = 200 Then

                    HTML.body.innerHTML = http.responseText

                    Set links = HTML.getElementsByTagName("a")

                    For Each link In links

                        If InStr(link.outerHTML, "06510") Then

                            LastRange.Value = link.href

                        End If




                    Next

                End If

                Set website = Nothing

            Else

                    'Debug.Print "Error loading page"
                    LastRange.Value = "Error with website address"

            End If

                    On Error GoTo 0

        End With

        row = row + 1

    Loop

    Application.ScreenUpdating = True

End Sub

After inspecting the page, here's a sample of the kind of URL to extract - https://www.realtor.com/realestateandhomes-detail/239-Bradley-St_New-Haven_CT_06510_M36855-92189. Any help will be appreciated

Using QHarr's code in a simplified way...

     Sub GetLinks()

    Dim url As String, links_count As Integer
    Dim j As Integer, row As Integer
    Dim XMLHTTP As Object, html As Object
    'Dim tr_coll As Object, tr As Object
    'Dim elements As Object
    Dim i As Long, allLinksOfInterest As Object
    'Dim td_coll As Object, td As Object, td_col, objT

    url = "https://www.realtor.com/realestateandhomes-search/06510"

    Set XMLHTTP = CreateObject("MSXML2.XMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.send

    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.responseText



    Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")

    For i = 0 To allLinksOfInterest.Length - 1
    Debug.Print allLinksOfInterest.Item(i).href
    Next

End Sub

Please check if I'm missing anything. I'm still getting the error "Object doesn't support this property or method"

Upvotes: 0

Views: 331

Answers (2)

QHarr
QHarr

Reputation: 84465

Don't use Instr on entire node outerHTML during a loop of all a tags. There are times when this is required but this shouldn't be one of them (hopefully).

You want to use attribute = value css selector with contains, *, operator. It is specifically for the purpose of matching on substrings in attribute values. This is more efficient.

Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")

So,

Dim i As Long, allLinksOfInterest As Object

Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
For i = 0 To allLinksOfInterest.Length - 1
    Debug.Print allLinksOfInterest.Item(i).href
Next

Attribute = value with contains operator:

[attr*=value]

Represents elements with an attribute name of attr whose value contains at least one occurrence of value within the string.


VBA:

Produces 26 links currently.All are relative links so need domain added as shown in loop. Some are duplicates so consider adding to a dictionary as keys so as remove duplicates.

Option Explicit

Public Sub GetLinks()
    Dim html As HTMLDocument

    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.realtor.com/realestateandhomes-search/06510", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim i As Long, allLinksOfInterest As Object

    Set allLinksOfInterest = html.querySelectorAll("[href*='06510']")
    For i = 0 To allLinksOfInterest.Length - 1
        Debug.Print Replace$(allLinksOfInterest.item(i).href,"about:","https://www.realtor.com")
    Next
End Sub

Upvotes: 1

Aman
Aman

Reputation: 194

If InStr(link.outerHTML, "06510") Then

In the code above, InStr function was used like boolean function. But it is not boolean, instead it returns integer. So, you should add comparison operator after function. May be like:

If InStr(link.outerHTML, "06510")>0 Then

Upvotes: 1

Related Questions