Reputation: 5523
I have written a VBA macro to count the (approximate) number of images returned for a Google search of a specific term. By approximate I mean that the program should count the number of images returned, scroll down to load some more (where applicable) up to a max of 400 images counted. Here's the (simplified) code:
Sub GoogleCount ()
'''
'[Code to construct the URL ('fullUrl')]
'''
Set objIE = New InternetExplorer
objIE.navigate fullUrl
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set currPage = objIE.document
'Count images returned
newNum = currPage.getElementById("rg_s").getElementsByTagName("IMG").Length
'Scroll down until count = 400 (max) or no change in value
Do While newNum >= 100 And newNum < 400 And newNum <> oldNum
oldNum = newNum
currPage.parentWindow.scrollBy 0, 100000
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
newNum = currPage.getElementById("rg_s").getElementsByTagName("IMG").Length
Loop
'''
'[Code to paste the value of newNum into my workbook, and do some other progress reporting]
'''
End Sub
I'm unhappy about scrolling, it feels very 'manual', especially when you are scrolling by a fixed value (any point making it dynamic? i.e. finding the end of the page and scrolling to there).
But the main problem is that it doesn't work: when I execute the code, it counts the first 100 (or fewer) images fine. But when it's supposed to scroll and count some more, I get the value of 100 returned. Slowly stepping through the code with F8 I get the proper numbers (max 400), which leads me to conclude that the code is running through too quickly (I may be wrong).
To slow the code down I tried adding the objIE.readyState
check loop, but because I'm only scrolling I don't think it counts as the page 're-loading' so the loop is ineffective in waiting for the new images to load.
I've thought about adding in a time delay instead. I am already employing
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
elsewhere in the worksheet - so, I could add as small as a millisecond-order delay.
But I really want to avoid using that, as this code runs for c. 50 different searches and takes long enough to execute already, adding in fixed delays that are long enough to accommodate slow connection speeds would not be ideal. Also, internet speeds vary so much that a fixed delay is very unreliable - I could carry out some kind of connection test to get a better ball-park figure, but the best option is obviously only to wait as long as you have to.
Or better still find a different way of counting the images, preferably one which doesn't involve re-loading the page 4 times! Any ideas?
NB. If you want to debug yourself, a good image search to set fullUrl
to might be https://www.google.com/search?q=stack overflow|exchange&tbm=isch&source=lnt&tbs=isz:ex,iszw:312,iszh:390
as it returns >100 images but fewer than 400 so you can test all aspects of the code
Upvotes: 0
Views: 2412
Reputation: 5523
Through further research I've come up with this approach:
Dim myDiv As HTMLDivElement: Set myDiv = currPage.getElementById("fbar")
Dim elemRect As IHTMLRect: Set elemRect = myDiv.getBoundingClientRect
Do Until elemRect.bottom > 0
currPage.parentWindow.scrollBy 0, 10000
Set elemRect = myDiv.getBoundingClientRect
Loop
myDiv.ScrollIntoView
Where currPage
is the HTML webpage (Dim currPage As HTMLDocument
) and myDiv
is a particular element. The type is not important, but it should be noted that myDiv
is always located at the bottom of the document and is only loaded once everything else has been. So for Google images that's the help bar, which you only get to after scrolling through all the image results.
The code works as follows: myDiv.getBoundingClientRect
is a way of checking whether an element is visible in the browser - that's why we need to look at an element at the bottom of the page, as if we scroll until that becomes visible, then everything else must have loaded too.
That's of course where the Do Until...Loop
comes from; we loop until the elemRect.bottom
value is not zero (as when the element is not in view, it's zero, once it's in view it becomes a non-zero number). More info on that see here
Finally, use a myDiv.ScrollIntoView
to get the browser right to the bottom; this is necessary because the BoundingClientRect
is visible slightly before the element is on screen, so we need to scroll the last bit in order to load the final images.
Why not just use ScrollIntoView
form the start? It doesn't work, since the element hasn't loaded yet.
Upvotes: 2
Reputation: 520
Just do this instead, I am sure you can find a nicer way to do it (if you think it's worth the time) but this should be fine :
newNum = -1
Set objIE = New InternetExplorer
objIE.navigate fullUrl
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Set currPage = objIE.document
Do Until oldNum = newNum
oldNum = newNum
newNum = currPage.getElementById("rg_s").getElementsByClassName("rg_di rg_bx rg_el ivg-i").Length
Application.Wait Now + TimeSerial(0, 0, 2)
currPage.parentWindow.scrollBy 0, 100000
Application.Wait Now + TimeSerial(0, 0, 2)
If newNum > 400 Then newNum = 400
Loop
Then you just have to adapt the delay in TimeSerial depending on how fast your computer loads ( in here I set in to 2 seconds)
Upvotes: 0