kiran mamalwad
kiran mamalwad

Reputation: 159

Web Scraping Elements By Class & Tag name

I'm trying to copy data from below mentioned web-site, I need the all range of sizes,Price,Amenities,Specials, Reserve. I frame below code but I'm able to copy element properly. first thing only three elements are coping with duplication also I'm not getting result for Amenities and Reserve. Can anybody please look into this?

Sub text()


Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
    .Visible = True
    .Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955" 

    While .Busy Or .readyState < 4: DoEvents: Wend

    Sheets("Unit Data").Select


    Dim listings As Object, listing As Object, headers(), results()
    Dim r As Long, list As Object, item As Object
    headers = Array("size", "features", "Specials", "Price", "Reserve")
    Set list = .document.getElementsByClassName("units_table")
    '.unit_size medium, .features, .Specials, .price, .Reserve
    Dim rowCount As Long
    rowCount = .document.querySelectorAll(".tab_container li").Length
    ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
    For Each listing In list
            For Each item In listing.getElementsByClassName("unitinfo even")
            r = r + 1

          results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
          results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
           results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
        results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
          results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText





        Next
    Next
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    .Quit
End With

  Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub

Upvotes: 5

Views: 4430

Answers (3)

QHarr
QHarr

Reputation: 84465

tl;dr;

Apologies in advance (to some) for the answer length but I thought I would take this pedagogic moment to detail what is going on.

The overall approach I use is the same as in your code: Find a css selector to isolate rows (despite being in different tabs the small, medium, large are actually still all present on page):

Set listings = html.querySelectorAll(".unitinfo")

The above generates the rows. As before, we dump this into a new HTMLDocument so we can leverage querySelector/querySelectorAll methods.


Rows:

Let's take a look at the first row html we are retrieving. The subsequent sections will take this row as a case study to talk through how info is retrieved:

5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
                // Refine Search
                //
                $(function() {
                    $("#a5x5-1").tooltip({
                        track: false,
                        delay: 0,
                        showURL: false,
                        left: 5,
                        top: 5,
                        bodyHandler: function () {
                            return "        <div class=\"tooltip\">            <div class=\"tooltop\"></div>            <div class=\"toolmid clearfix\">                <div class=\"toolcontent\">                    <div style=\"text-align:center;width:100%\">                        <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" />                    </div>                    <div class=\"display\">5 x 5</div>                    <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div>                </div>                <div class=\"clearfix\"></div>            </div>            <div class=\"toolfoot\"></div>            <div class=\"clearfix\"></div>        </div>        "}
                    });
                });
        </SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
    $('.units_table tr.unitinfo').each(function(index, el) {
        if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
            $(this).parents('.units_table').addClass('both');
            $(this).addClass('also-both');
            $(this).find('.rates').addClass('rates_two_column');
        }
    });
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>

Each row we are going to be working with will have similar html inside of the html2 variable. If you were in doubt look at the javascript in the function shown above:

$('.units_table tr.unitinfo').each(function(index, el) 

it is using the same selector (but also specifying the parent table class and element type (tr)). Basically, that function is being called for each row in the table.


Size:

For some reason the opening td tag is being dropped (I've seen this with missing parent <table> tags I think) so for size, rather than grab by class, I am looking for the start of the closing tag and extracting the string up to there. I do this by by passing the return value given by Instr (where < was found in string) -1 to the Left$ (typed) function.

enter image description here

results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)

This returns 5x5.


Description:

The description column is populated by the function we saw above (which is applied to each row remember)

This bit - $("#a5x5-1").tooltip - tells it where to target, and then the return statement of the function provides the html which has a div, with class description, containing the text we want. As we are not using a browser, and I am on 64 bit windows, I cannot evaluate this script, but I can use split to extract the string (description) between the "description\"> and the start of the closing associated div tag:

results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)

This returns:

"Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle."


Rate type and price:

These are straightforward and use class name to target:

results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)

Returning (respectively)

Web Rate , £39.00


Amenities:

This is where things are a little trickier.

Let's re-examine the html shown above, for this row, that pertains to amenities:

<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>

We can see that parent td has a class of amenities, which has child div elements which have compound class names; the latter of which, in each case, serves as an identifier for amenity type e.g. icon_climate.

When you hover over these, on the page, tooltip info is presented:

enter image description here

We can trace the location of this tooltip in the html of the actual page:

enter image description here

As you hover over different amenities this content updates.

To cut a long story short (he says whilst half way down the page!), this content is being updated from a php file on the server. We can make a request for the file and construct a dictionary which maps the class name of each amenities e.g. amenity_icon icon_climate (which as are compound classes need " " replaced with "." when converting to the appropriate css selector of .amenity_icon.icon_climate) to the associated descriptions. You can explore the php file here.

The php file:

Let's look at just the start of the file so as to dissect the basic unit of what is a repeating pattern:

function LoadTooltips() {
        $(".units_table .amenity_icon.icon_climate").tooltip({
        track: false,
        delay: 0,
        showURL: false,
        left: -126,
        top: -100,
        bodyHandler: function () {
            return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
        }
    });

The function responsible for updating the tooltip is LoadTooltips. CSS class selectors are used to target each icon:

$(".units_table .amenity_icon.icon_climate").tooltip

And we have the bodyhandler specifying the return text:

bodyHandler: function () {
            return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"

We have three bits of useful information that appear in repeating groups. The class name selector for the element, the short description and the long description e.g.

  1. .amenity_icon.icon_climate : we use this to map the php file descriptions to the class name of the amenity icon in our row. CSS selector
  2. Temperature Controlled; inside h4 tag of tooltip function return text. Short description
  3. Units are heated and/or cooled. See manager for details.; inside p tag of tooltip function return text. Long description

I write 2 functions, GetMatches and GetAmenitiesDescriptions, that use regex to extract all of the repeating items, for each icon, and return a dictionary which has the css selector as the key, and the short description : long description as the value.

When I gather all the icons in each row:

Set icons = html2.querySelectorAll(".amenity_icon")
        

I use the dictionary to return the tooltip descriptions based on the class name of the icon

For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
    amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next        

I then join the descriptions with vbNewLine to ensure output is on different lines within output cell.

You can explore the regex here.

The regex uses | (Or) syntax so I return all matched patterns in a single list.

arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")

As I will want different submatches (0,1 or 2 a.k.a css class selector, short desc, long desc) I use a Select Case i mod 3, with counter variable i, to extract appropriate sub-matches.

Example of those matches for the mapping in the php file:

enter image description here


Specials:

We are back to class selectors. Offer2 is not populated so you could remove.

results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText

returns (respectively):

Call for Specials, empty string


Closing remarks:

So, the above walks you through one row. It is simply rinse and repeat in the loop over all rows. Data is added, for efficiency, to an array, results; which is then written to Sheet1 in one go. There are some minor improvements I can see but this is fast.


VBA:

Option Explicit
Public Sub GetInfo()
    Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
    Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    Set amenitiesDescriptions = GetAmenitiesDescriptions
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText

        html.body.innerHTML = s

        Dim headers(), results(), listings As Object, amenities As String

        headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
        Set listings = html.querySelectorAll(".unitinfo")

        Dim rowCount As Long, numColumns As Long, r As Long, c As Long
        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

        rowCount = listings.Length
        numColumns = UBound(headers) + 1

        ReDim results(1 To rowCount, 1 To numColumns)
        Dim html2 As HTMLDocument
        Set html2 = New HTMLDocument
        For item = 0 To listings.Length - 1
            r = r + 1
            html2.body.innerHTML = listings.item(item).innerHTML
            results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
            results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
            results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
            results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
            
            Set icons = html2.querySelectorAll(".amenity_icon")
            ReDim amenitiesInfo(0 To icons.Length - 1)
            
            For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
                amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
            Next

            amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out

            results(r, 5) = amenities
            results(r, 6) = html2.querySelector(".offer1").innerText
            results(r, 7) = html2.querySelector(".offer2").innerText
        Next

        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
    Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
    ' view regex here: https://regex101.com/r/bII5AL/1
    Set dict = CreateObject("Scripting.Dictionary")
    Set re = CreateObject("vbscript.regexp")
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText
        
        arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
        For i = LBound(arr) To UBound(arr) Step 3  'build up lookup dictionary for amenities descriptions
            dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
        Next
    End With
    Set GetAmenitiesDescriptions = dict
End Function

Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .Pattern = sPattern
        If .test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                Select Case i Mod 3
                Case 0
                    arrMatches(i) = iMatch.SubMatches.item(0)
                Case 1
                    arrMatches(i) = iMatch.SubMatches.item(1)
                Case 2
                    arrMatches(i) = iMatch.SubMatches.item(2)
                End Select
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

Output:

enter image description here


References (VBE > Tools > References):

  1. Microsoft HTML Object Library

Upvotes: 3

Stavros Jon
Stavros Jon

Reputation: 1697

Here's one way to do it:

Sub test()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim targetTable As HTMLTable
Dim tableRow As HTMLTableRow
Dim tableCell As HTMLTableCell
Dim element As HTMLDivElement
Dim sht As Worksheet
Dim amenitiesString As String
Dim i As Long
Dim j As Long
Set sht = ThisWorkbook.Worksheets("Sheet1")
With req
    .Open "GET", "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955", False
    .send
    doc.body.innerHTML = .responseText
End With

Set targetTable = doc.getElementById("units_small_units") 'You can use units_medium_units or units_large_units to get the info from the other tabs
i = 0
For Each tableRow In targetTable.Rows
    i = i + 1
    j = 0
    For Each tableCell In tableRow.Cells
    amenitiesString = ""
    j = j + 1
        If tableCell.className = "amenities" And tableCell.innerText <> "Amenities" Then
            For Each element In tableCell.getElementsByTagName("div")
                amenitiesString = amenitiesString & element.Title & ","
            Next element
            sht.Cells(i, j).Value = amenitiesString
        ElseIf tableCell.className <> "features" Then
            sht.Cells(i, j).Value = tableCell.innerText
        End If
    Next tableCell
Next tableRow

End Sub

I'm using an HTTP request instead of Internet Explorer to get the HTML. Apart from that I think you can get an idea of how to access the elements you want.

Here's a screenshot of the result.

enter image description here

The presentation is a bit primitive but you get the idea :-P

Basically this:

listing.getElementsByClassName("amenities")(0).innerText

will return a blank, because there is no inner text in these elements. The information is produced by a script but it can also be found in the title of the div elements.

References used:

Microsoft HTML Object Library and WinHTTP Services Version 5.1

Upvotes: 1

user960881
user960881

Reputation:

Can you try Jquery get approach like below:

$.get( 'url', function(data) {

// Loop through elements
$(data).find("ul").find("li").each( function(){

    var text = $(this).text();

} )

} );

Upvotes: 0

Related Questions