Reputation: 22440
I've written a script in vba in combination with selenium to parse all the company names available in a webpage. The webpage has got lazyloading method active so there are only 20 links become visible in each scroll. If I scroll 2 times then the number of links visible are 40 and so on. There are 1000 links available in that webpage. My below script can reach the bottom of that page handling all the scroll and fetch all the names available in that webpage.
However, it is necessary to wait a certain time after each scroll for that webpage to update the content. This is where I've used hardcoded delay
but the process of hardcoding thing is very inconsistent and sometimes it makes the browser quit before the completion of the whole operation.
How can I modify this portion .Wait 6000
to make it Explicit Wait
instead of Hardcoded Wait
.
This is what I've written so far:
Sub Getlinks()
Dim driver As New ChromeDriver, prevlen&, curlen&
Dim posts As Object, post As Object
With driver
.get "http://fortune.com/fortune500/list/"
prevlen = .FindElementsByClass("company-title").Count
Do
prevlen = curlen
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
.Wait 6000 ''I like to kick out this hardcoded delay and use explicit wait in place
Set posts = .FindElementsByClass("company-title")
curlen = posts.Count
If prevlen = curlen Then Exit Do
Loop
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
Upvotes: 3
Views: 644
Reputation: 5677
Here is a completely different approach that doesn't require using a browser, instead it submits a series of web requests. With this approach, waiting for a page to load isn't a concern.
Typically, with lazy loading pages, it will submit a new request to load up the data for the page as you scroll. If you monitor the web traffic you can spot the requests made and emulate those, I have done that below.
The result should be a list of company names, in ascending order in whatever the first sheet of Excel is.
Things you'll need:
Add References to:
Edit
Changed the code to keep pulling data from the site, until there is no more items in the list. Thanks @Qharr for pointing this out.
Code
Public Sub SubmitRequest()
Const baseURL As String = "http://fortune.com/api/v2/list/2358051/expand/item/ranking/asc/"
Dim Url As String
Dim startingNumber As Long
Dim j As Long
Dim getRequest As MSXML2.XMLHTTP60
Dim Json As Object
Dim Companies As Object
Dim Company As Variant
Dim CompanyArray As Variant
'Create an array to hold each company
ReDim CompanyArray(0 To 50000)
'Create a new XMLHTTP object so we can place a get request
Set getRequest = New MSXML2.XMLHTTP60
'The api seems to only support returning 100 records at a time
'So do in batches of 100
Do
'Build the url, the format is something like
'0/100, where 0 is the starting position, and 100 is the ending position
Url = baseURL & startingNumber & "/" & startingNumber + 100
With getRequest
.Open "GET", Url
.send
'The response is a JSON object, for this code to work -
'You'll need this code https://github.com/VBA-tools/VBA-JSON
'What is returned is a dictionary
Set Json = JsonConverter.ParseJson(.responseText)
Set Companies = Json("list-items")
'Keep checking in batches of 100 until there are no more
If Companies.Count = 0 Then Exit Do
'Iterate the dictionary and return the title (which is the name)
For Each Company In Companies
CompanyArray(j) = Company("title")
j = j + 1
Next
End With
startingNumber = startingNumber + 100
Loop
ReDim Preserve CompanyArray(j - 1)
'Dump the data to the first sheet
ThisWorkbook.Sheets(1).Range("A1:A" & j) = WorksheetFunction.Transpose(CompanyArray)
End Sub
Upvotes: 4
Reputation: 1172
Define a timeout (specified period of time that will be allowed to elapse) to get rid of the hardcoded delay. The timeout needs to be hardcoded.
The differences between this and your original code are:
Code:
Sub Getlinks()
Dim driver As New ChromeDriver, prevlen&, curlen&
Dim posts As Object, post As Object
Dim timeout As Integer, startTime As Double
timeout = 10 ' set the timeout to 10 seconds
With driver
.get "http://fortune.com/fortune500/list/"
prevlen = .FindElementsByClass("company-title").Count
startTime = Timer ' set the initial starting time
Do
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
Set posts = .FindElementsByClass("company-title")
curlen = posts.Count
If curlen > prevlen Then
startTime = Timer ' reset start time if new elements found
prevlen = curlen ' set new prevlen
End If
Loop While Round(Timer - startTime, 2) <= timeout ' check if timeout is reached
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
Upvotes: 1
Reputation: 6433
I think you are almost there.
Although I don't think you can avoid waiting, the work around is to keep a number of times checking for new posts as you scroll down with a shorter wait.
Example below is to check for new posts 5 times each with 2 seconds wait, so a total of 10 seconds before declaring end of the page. Adjust these 2 parameters to suit.
Sub Getlinks()
Dim driver As New ChromeDriver, prevlen&, curlen&
Dim posts As Object, post As Object
' Counter for number of times when there are NO NEW POSTS
Dim NoIncreaseCount As Integer
Const MaxNoIncreaseCount As Integer = 5
Const WaitTime As Integer = 2000 ' 2 seconds wait time each scroll down
With driver
.get "http://fortune.com/fortune500/list/"
prevlen = .FindElementsByClass("company-title").Count
NoIncreaseCount = 0
Do Until NoIncreaseCount = MaxNoIncreaseCount
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
.Wait WaitTime
Set posts = .FindElementsByClass("company-title")
curlen = posts.Count
If prevlen < curlen Then
' There are new Posts
prevlen = curlen
NoIncreaseCount = 0
Else
' No new Posts
NoIncreaseCount = NoIncreaseCount + 1
End If
Loop
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
Upvotes: 0
Reputation: 2556
There you go:
Sub Getlinks()
Dim driver As New ChromeDriver
Dim pcount As Long, R as long
Dim posts As Object, post As Object
With driver
.get "http://fortune.com/fortune500/list/"
Do
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
Set posts = .FindElementsByClass("company-title")
pcount = posts.Count
Loop Until pcount = 1000
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
Or even better, print as you go:
Sub Getlinksasyougo()
Dim driver As New ChromeDriver
Dim pcount As Long, R As Long, i As Long
Dim posts As Object, post As Object
With driver
.get "http://fortune.com/fortune500/list/"
i = 1
Do
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
Set posts = .FindElementsByClass("company-title")
pcount = posts.Count
If i <> pcount Then
For R = i To pcount - 1
Cells(R, 1) = posts(R + 1).Text
Next R
i = pcount
End If
Loop Until pcount = 1000
End With
End Sub
Upvotes: 3
Reputation: 504
Here's a way to approach it using the "look for the spinner element" method discussed in one of the comments, which helps you avoid having to specify the number of elements you're expecting the page to load. The class name of the spinner actually changes depending on whether or not it's visible, which makes it pretty easy to just wait for the spinner to become visible + disappear again before getting the page elements.
This method still involves some waiting; by default, it waits 1/10th of a second after each attempt to find the spinner, either until the spinner is found or for some maximum number of attempts. But that's much faster than waiting 5 seconds every time.
Also, unrelated, but don't write stuff to cells one at a time, it's really slow. It's much faster to write it to an array first + write the entire array at once.
Sub getLinks()
Dim bot As New ChromeDriver
bot.Get "http://fortune.com/fortune500/list/"
Dim posts As WebElements
Dim numPosts As Long
Dim finishedScrolling As Boolean
finishedScrolling = False
Do Until finishedScrolling
'Set beginning post count and scroll down
Dim startPosts As Long
startPosts = numPosts
bot.ExecuteScript "window.scrollTo(0, document.body.scrollHeight);"
'Wait for spinner to become visible, then wait for up to 5 seconds for rehide
Call waitForElements(bot, "div[class^='F500-spinner ']", 50)
Call waitForElements(bot, "div[class^='F500-spinner hide']", 50)
'See if any new posts have loaded
Set posts = bot.FindElementsByClass("company-title")
numPosts = posts.Count
If numPosts = startPosts Then
finishedScrolling = True
End If
Loop
'Write text to results array
Dim post As WebElement
ReDim resultsArr(1 To posts.Count, 1 To 1) As String
Dim i As Long
i = 1
For Each post In posts
resultsArr(i, 1) = post.Text
i = i + 1
Next
'Write array to sheet
With ActiveSheet
.Range(.Cells(1, 1), .Cells(UBound(resultsArr, 1), 1)).Value = resultsArr
End With
End Sub
Sub waitForElements(bot As WebDriver, css As String, maxAttempts As Long, Optional waitTimeMS As Long = 100)
'Use a CSS selector string to wait for element(s) to appear on a page or to reach max number of attempts
'By default, bot waits 0.1 second after each attempt
Dim i As Long
Dim foundElem As Boolean
foundElem = False
Do Until foundElem
i = i + 1
If bot.FindElementsByCss(css).Count > 0 Then
foundElem = True
ElseIf i = maxAttempts Then
foundElem = True
Else
bot.Wait waitTimeMS
End If
Loop
End Sub
Upvotes: 2
Reputation: 3634
I don't know if this will help as it's still a 'hard-coded' solution but you could try a delay function rather than the wait function and see if that helps with the program exiting issue.
Function Delay(Seconds As Single)
Dim StopTime As Single: StopTime = Timer + Seconds
Do While Timer < StopTime
DoEvents
Loop
End Function
Upvotes: 1