lalachka
lalachka

Reputation: 413

scraping data from website using vba - problem

I'm trying to get the address, facility type and some other data from every facility on this search. I'm able to get the search results and the list of facilities but I cannot figure out how to get the data from the page.

EDIT i've applied the suggestiong in the answer, here's the new code and the OBJECT REQUIRED error is at the DEBUG line I'm trying to click on each link and get the name, address, facility type and whatever other data is on that page

Sub Test()
    Dim ie2 As New InternetExplorer
    'Set ie = New InternetExplorerMedium

    With ie2
        .Visible = True
        .navigate "https://healthapps.state.nj.us/facilities/fsSetSearch.aspx?by=county"

        FacType = "Long-Term Care (Nursing Homes)"
        While .Busy Or .ReadyState < 4: DoEvents: Wend

        With .Document
            .querySelector("#middleContent_cbType_0").Click
            .querySelector("#middleContent_btnGetList").Click
        End With
            
        While .Busy Or .ReadyState < 4: DoEvents: Wend
            

        Pause (2)
        
        Dim list2 As Object, i2  As Long, line1 As String, line2 As String

        Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
        
        For i2 = 0 To list2.Length - 1
            list2.Item(i2).Click
            Debug.Print .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
            
            While .Busy Or .ReadyState < 4: DoEvents: Wend

            Pause (2)

            address = Replace(Replace(Replace(line1 & " " & line2, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "<br>", ", "), "</span>", "")

            WriteTable .Document.getElementsByTagName("table")(3), .Document.getElementById("middleContent_Menu1").innerText

            .Navigate2 .Document.URL
            While .Busy Or .ReadyState < 4: DoEvents: Wend
            Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")

        Next
        .Quit                                    '
    End With

End Sub

I get the OBJECT REQUIRED error at this line

Address = Replace(Replace(Replace(.Document.getElementById("middleContent_lbAddress").outerHTML, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "
", ", "), "", "")

but I'm pretty sure that I'm using the wrong way to get the data anyway. So, even without the error I wouldn't have given me what I need.

Sub Test()
    Dim ie2 As New InternetExplorer
    'Set ie = New InternetExplorerMedium

    With ie2
        .Visible = False
        .navigate "https://healthapps.state.nj.us/facilities/fsSetSearch.aspx?by=county"

        While .Busy Or .ReadyState < 4: DoEvents: Wend

        With .Document
            .querySelector("#middleContent_cbType_0").Click
            .querySelector("#middleContent_btnGetList").Click
        End With
            
        While .Busy Or .ReadyState < 4: DoEvents: Wend
            
        Dim list2 As Object, i2  As Long
        Set list2 = .Document.querySelectorAll("#main_table")
             
        For i2 = 0 To list2.Length - 1
            list2.Item(i2).Click

            While .Busy Or .ReadyState < 4: DoEvents: Wend

            Pause (2)
    
            If .Document.getElementById("middleContent_lbResultTitle") Is Nothing Then
                Pause (5)
            End If

            If .Document.getElementById("middleContent_lbResultTitle").outerHTML Like "*Long-Term Care Facility*" Then
                FacType = "Long-Term Care (Nursing Homes)"
            End If

            Address = Replace(Replace(Replace(.Document.getElementById("middleContent_lbAddress").outerHTML, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "<br>", ", "), "</span>", "")

            WriteTable .Document.getElementsByTagName("table")(3), .Document.getElementById("middleContent_Menu1").innerText


            .Navigate2 .Document.URL
            While .Busy Or .ReadyState < 4: DoEvents: Wend
            Set list2 = .Document.querySelectorAll("#main_table")

        Next
        .Quit                                    '
    End With
End Sub

Upvotes: 0

Views: 329

Answers (1)

QHarr
QHarr

Reputation: 84465

This is a single node Set list2 = .Document.querySelectorAll("#main_table"). Instead, assuming same structure for all results use something like:

Dim i As Long, line1 As String, line2 As String, address As String

Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")

For i = 0 To list2.Length - 1
    line1 = list2.Item(i).NextSibling.NextSibling.NodeValue
    line2 = list2.Item(i).NextSibling.NextSibling.NextSibling.NodeValue
    address = line1 & " " & line2 'apply string cleaning here
Next

This targets initially the hyperlinks for each result, then moves across the br elements with nextSibling to get the address line 1 and 2. You will need to write some string cleaning on the address variable.

If you decide to click each hyperlink, then on the detailed info page use .document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText to retrieve the full address.

An example of navigating to each page (check urls retrieved are complete and don't require a prefix)

Dim i As Long, address As String, urls(), numLinks As Long

Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
numLinks = List.Length - 1
ReDim urls(0 To numLinks)

For i = 0 To numLinks
    urls(i) = list2.Item(i).href
Next

For i = 0 To numLinks
    .navigate2 urls(i)
    While .Busy Or .ReadyState <> 4: DoEvents: Wend
    'time loop maybe goes here
    address = .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
    Debug.Print address
Next

Upvotes: 1

Related Questions