Reputation: 83
wondering if anyone can give me some advice for using Selenium with Python for webscraping.
I need to get the number of elements with a certain class on a page, and I have it working well with
driver=webdriver.PhantomJS()
driver.get('https://www.somerandomsite.com/1')
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
this gets the right number of elements every time.
But now I want to define a function so it can scrape multiple webpages - say https://www.somerandomsite.com/1 to https://www.somerandomsite.com/10
So I do
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
start += 1
Theoretically, this should move onto the next page, and retrieve the number of classes that I want in that page. However, it works fine for the first page, but subsequent pages yield a number of elements that's either equal to the number of elements of the previous page plus that of the current page, or that sum minus 1. If I use an xpath instead of a class name selector I get the exact same results.
Also, if I try to access any elements that are in that longer list, it throws an error since only the values on that page actually exist. So I have no idea how it's getting that longer list if the elements on it don't even exist. (For example, if there are 8 elements on page one and 5 elements on page two, when it gets to page two it'll say there are 12 or 13 elements. If I access elements 1-5 they all return values, but trying to call the sixth element or higher will cause a NoSuchElementException.)
Anyone know why this might be happening?
EDIT: I've narrowed it down a bit more, hopefully this helps. Sorry I was off in the initial question.
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
start += 1
So the above code actually works. However, when I then navigate to another page that also has elements of 'some_class', and then continue looping, it adds the number of elements from the previous page to the current page.
So my code's like this:
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
print(number_of_elements)
driver.get('https://www.somerandomsite.com/otherpage')
start += 1
my_func(1,2)
So let's say https://www.somerandomsite.com/1 has 8 elements of class 'some_class', https://www.somerandomsite.com/otherpage has 7 elements of class 'some_class', and https://www.somerandomsite.com/2 has 10 elements of class 'some_class'.
When I run the above code, it'll print 8, then 17. If I don't navigate to the other page, and run
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
start += 1
So the above code actually works. However, when I then navigate to another page that also has elements of 'some_class', and then continue looping, it adds the number of elements from the previous page to the current page.
So my code's like this:
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
print(number_of_elements)
start += 1
my_func(1,2)
it'll print 8 then 10, as I want it to. I'm not sure why it's counting elements on two pages at once, and only if I get that other page beforehand.
EDIT2: So I've gotten it working by navigating to a page on a different server and then returning to the page I want. Weird, but I'll use it. If anyone has any ideas on why it doesn't work if I don't though I'd still love to understand the problem better.
Upvotes: 0
Views: 189
Reputation: 740
Difficult to tell what - if at all - the problem is as you don't provide the necessary details to replicate what you're describing.
IMHO a function is overkill for this simple task. Just toss it and create the loop. In general I'd put the loop outside.
Also you need a function call for this to do anything at all - and a return statement.
In general for similar stuff I'd put the loop outside the function.
Like so:
def my_func(driver, count):
driver.get('https://www.somerandomsite.com/%d' % count)
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
return number_of_elements
driver=webdriver.PhantomJS()
total_element_count = 0
count = 1
while count < 1000: # or whatever number you need
number_of_elements = my_func(driver, count)
total_element_count += number_of_elements
print("[*] Elements for iteration %d: %d" % (count, number_of_elements))
print("[*] Total count so far: %d" % total_element_count)
count +=1
Upvotes: 1
Reputation: 4267
Take a look at
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
You asign len of elements on each iteration, but instead you need to sum them, so your code should look like:
driver=webdriver.PhantomJS()
def my_func(start,end):
count = 0
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
count += len(driver.find_elements_by_class_name('some_class'))
start += 1
Upvotes: 0