Reputation: 159
I'm trying to copy data from below mentioned web-site, I need the all range of sizes,Price,Amenities,Specials, Reserve. I frame below code but I'm able to copy element properly. first thing only three elements are coping with duplication also I'm not getting result for Amenities and Reserve. Can anybody please look into this?
Sub text()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
.Visible = True
.Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "Specials", "Price", "Reserve")
Set list = .document.getElementsByClassName("units_table")
'.unit_size medium, .features, .Specials, .price, .Reserve
Dim rowCount As Long
rowCount = .document.querySelectorAll(".tab_container li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByClassName("unitinfo even")
r = r + 1
results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub
Upvotes: 5
Views: 4430
Reputation: 84465
tl;dr;
Apologies in advance (to some) for the answer length but I thought I would take this pedagogic moment to detail what is going on.
The overall approach I use is the same as in your code: Find a css selector to isolate rows (despite being in different tabs the small, medium, large are actually still all present on page):
Set listings = html.querySelectorAll(".unitinfo")
The above generates the rows. As before, we dump this into a new HTMLDocument
so we can leverage querySelector/querySelectorAll
methods.
Rows:
Let's take a look at the first row html we are retrieving. The subsequent sections will take this row as a case study to talk through how info is retrieved:
5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
// Refine Search
//
$(function() {
$("#a5x5-1").tooltip({
track: false,
delay: 0,
showURL: false,
left: 5,
top: 5,
bodyHandler: function () {
return " <div class=\"tooltip\"> <div class=\"tooltop\"></div> <div class=\"toolmid clearfix\"> <div class=\"toolcontent\"> <div style=\"text-align:center;width:100%\"> <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" /> </div> <div class=\"display\">5 x 5</div> <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div> </div> <div class=\"clearfix\"></div> </div> <div class=\"toolfoot\"></div> <div class=\"clearfix\"></div> </div> "}
});
});
</SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
$('.units_table tr.unitinfo').each(function(index, el) {
if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
$(this).parents('.units_table').addClass('both');
$(this).addClass('also-both');
$(this).find('.rates').addClass('rates_two_column');
}
});
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>
Each row we are going to be working with will have similar html inside of the html2
variable. If you were in doubt look at the javascript in the function shown above:
$('.units_table tr.unitinfo').each(function(index, el)
it is using the same selector (but also specifying the parent table class and element type (tr
)). Basically, that function is being called for each row in the table.
Size:
For some reason the opening td
tag is being dropped (I've seen this with missing parent <table>
tags I think) so for size, rather than grab by class, I am looking for the start of the closing tag and extracting the string up to there. I do this by by passing the return value given by Instr (where < was found in string) -1 to the Left$
(typed) function.
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
This returns 5x5
.
Description:
The description column is populated by the function we saw above (which is applied to each row remember)
This bit - $("#a5x5-1").tooltip
- tells it where to target, and then the return statement of the function provides the html which has a div
, with class description
, containing the text we want. As we are not using a browser, and I am on 64 bit windows, I cannot evaluate this script, but I can use split
to extract the string (description) between the "description\">
and the start of the closing associated div
tag:
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
This returns:
"Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle."
Rate type and price:
These are straightforward and use class name to target:
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Returning (respectively)
Web Rate , £39.00
Amenities:
This is where things are a little trickier.
Let's re-examine the html shown above, for this row, that pertains to amenities:
<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>
We can see that parent td
has a class of amenities
, which has child div
elements which have compound class names; the latter of which, in each case, serves as an identifier for amenity type e.g. icon_climate
.
When you hover over these, on the page, tooltip info is presented:
We can trace the location of this tooltip in the html of the actual page:
As you hover over different amenities this content updates.
To cut a long story short (he says whilst half way down the page!), this content is being updated from a php file on the server. We can make a request for the file and construct a dictionary which maps the class name of each amenities e.g. amenity_icon icon_climate
(which as are compound classes need " " replaced with "." when converting to the appropriate css selector of .amenity_icon.icon_climate
) to the associated descriptions. You can explore the php file here.
The php file:
Let's look at just the start of the file so as to dissect the basic unit of what is a repeating pattern:
function LoadTooltips() {
$(".units_table .amenity_icon.icon_climate").tooltip({
track: false,
delay: 0,
showURL: false,
left: -126,
top: -100,
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
}
});
The function responsible for updating the tooltip is LoadTooltips
. CSS class selectors are used to target each icon:
$(".units_table .amenity_icon.icon_climate").tooltip
And we have the bodyhandler specifying the return text:
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
We have three bits of useful information that appear in repeating groups. The class name selector for the element, the short description and the long description e.g.
.amenity_icon.icon_climate
: we use this to map the php file descriptions to the class name of the amenity icon in our row. CSS selectorTemperature Controlled
; inside h4
tag of tooltip function return text. Short descriptionUnits are heated and/or cooled. See manager for details.
; inside p
tag of tooltip function return text. Long descriptionI write 2 functions, GetMatches
and GetAmenitiesDescriptions
, that use regex to extract all of the repeating items, for each icon, and return a dictionary which has the css selector as the key, and the short description : long description
as the value.
When I gather all the icons in each row:
Set icons = html2.querySelectorAll(".amenity_icon")
I use the dictionary to return the tooltip descriptions based on the class name of the icon
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
I then join the descriptions with vbNewLine
to ensure output is on different lines within output cell.
You can explore the regex here.
The regex uses |
(Or) syntax so I return all matched patterns in a single list.
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
As I will want different submatches (0,1 or 2 a.k.a css class selector, short desc, long desc) I use a Select Case i mod 3
, with counter variable i
, to extract appropriate sub-matches.
Example of those matches for the mapping in the php file:
Specials:
We are back to class selectors. Offer2
is not populated so you could remove.
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
returns (respectively):
Call for Specials, empty string
Closing remarks:
So, the above walks you through one row. It is simply rinse and repeat in the loop over all rows. Data is added, for efficiency, to an array, results
; which is then written to Sheet1
in one go. There are some minor improvements I can see but this is fast.
VBA:
Option Explicit
Public Sub GetInfo()
Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
Set amenitiesDescriptions = GetAmenitiesDescriptions
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
Dim headers(), results(), listings As Object, amenities As String
headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
Set listings = html.querySelectorAll(".unitinfo")
Dim rowCount As Long, numColumns As Long, r As Long, c As Long
Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long
rowCount = listings.Length
numColumns = UBound(headers) + 1
ReDim results(1 To rowCount, 1 To numColumns)
Dim html2 As HTMLDocument
Set html2 = New HTMLDocument
For item = 0 To listings.Length - 1
r = r + 1
html2.body.innerHTML = listings.item(item).innerHTML
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
Set icons = html2.querySelectorAll(".amenity_icon")
ReDim amenitiesInfo(0 To icons.Length - 1)
For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
Next
amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out
results(r, 5) = amenities
results(r, 6) = html2.querySelector(".offer1").innerText
results(r, 7) = html2.querySelector(".offer2").innerText
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
' view regex here: https://regex101.com/r/bII5AL/1
Set dict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
For i = LBound(arr) To UBound(arr) Step 3 'build up lookup dictionary for amenities descriptions
dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
Next
End With
Set GetAmenitiesDescriptions = dict
End Function
Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = sPattern
If .test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
Select Case i Mod 3
Case 0
arrMatches(i) = iMatch.SubMatches.item(0)
Case 1
arrMatches(i) = iMatch.SubMatches.item(1)
Case 2
arrMatches(i) = iMatch.SubMatches.item(2)
End Select
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Output:
References (VBE > Tools > References):
Upvotes: 3
Reputation: 1697
Here's one way to do it:
Sub test()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim targetTable As HTMLTable
Dim tableRow As HTMLTableRow
Dim tableCell As HTMLTableCell
Dim element As HTMLDivElement
Dim sht As Worksheet
Dim amenitiesString As String
Dim i As Long
Dim j As Long
Set sht = ThisWorkbook.Worksheets("Sheet1")
With req
.Open "GET", "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955", False
.send
doc.body.innerHTML = .responseText
End With
Set targetTable = doc.getElementById("units_small_units") 'You can use units_medium_units or units_large_units to get the info from the other tabs
i = 0
For Each tableRow In targetTable.Rows
i = i + 1
j = 0
For Each tableCell In tableRow.Cells
amenitiesString = ""
j = j + 1
If tableCell.className = "amenities" And tableCell.innerText <> "Amenities" Then
For Each element In tableCell.getElementsByTagName("div")
amenitiesString = amenitiesString & element.Title & ","
Next element
sht.Cells(i, j).Value = amenitiesString
ElseIf tableCell.className <> "features" Then
sht.Cells(i, j).Value = tableCell.innerText
End If
Next tableCell
Next tableRow
End Sub
I'm using an HTTP request instead of Internet Explorer to get the HTML. Apart from that I think you can get an idea of how to access the elements you want.
Here's a screenshot of the result.
The presentation is a bit primitive but you get the idea :-P
Basically this:
listing.getElementsByClassName("amenities")(0).innerText
will return a blank, because there is no inner text in these elements. The information is produced by a script but it can also be found in the title
of the div
elements.
References used:
Microsoft HTML Object Library
and WinHTTP Services Version 5.1
Upvotes: 1
Reputation:
Can you try Jquery get approach like below:
$.get( 'url', function(data) {
// Loop through elements
$(data).find("ul").find("li").each( function(){
var text = $(this).text();
} )
} );
Upvotes: 0