Reputation: 83

Selenium Python Selector Returning Too Many Values

wondering if anyone can give me some advice for using Selenium with Python for webscraping.

I need to get the number of elements with a certain class on a page, and I have it working well with

driver=webdriver.PhantomJS()
driver.get('https://www.somerandomsite.com/1')
number_of_elements = len(driver.find_elements_by_class_name('some_class'))

this gets the right number of elements every time.

But now I want to define a function so it can scrape multiple webpages - say https://www.somerandomsite.com/1 to https://www.somerandomsite.com/10

So I do

driver=webdriver.PhantomJS()
def my_func(start,end)
    while start <= end:
        driver.get('https://www.somerandomsite.com/'+str(start))
        number_of_elements = len(driver.find_elements_by_class_name('some_class'))
        start += 1

Theoretically, this should move onto the next page, and retrieve the number of classes that I want in that page. However, it works fine for the first page, but subsequent pages yield a number of elements that's either equal to the number of elements of the previous page plus that of the current page, or that sum minus 1. If I use an xpath instead of a class name selector I get the exact same results.

Also, if I try to access any elements that are in that longer list, it throws an error since only the values on that page actually exist. So I have no idea how it's getting that longer list if the elements on it don't even exist. (For example, if there are 8 elements on page one and 5 elements on page two, when it gets to page two it'll say there are 12 or 13 elements. If I access elements 1-5 they all return values, but trying to call the sixth element or higher will cause a NoSuchElementException.)

Anyone know why this might be happening?

EDIT: I've narrowed it down a bit more, hopefully this helps. Sorry I was off in the initial question.

driver=webdriver.PhantomJS()
def my_func(start,end)
    while start <= end:
        driver.get('https://www.somerandomsite.com/'+str(start))
        number_of_elements = len(driver.find_elements_by_class_name('some_class'))
        start += 1

So the above code actually works. However, when I then navigate to another page that also has elements of 'some_class', and then continue looping, it adds the number of elements from the previous page to the current page.

So my code's like this:

driver=webdriver.PhantomJS()
def my_func(start,end)
    while start <= end:
        driver.get('https://www.somerandomsite.com/'+str(start))
        number_of_elements = len(driver.find_elements_by_class_name('some_class'))
        print(number_of_elements)
        driver.get('https://www.somerandomsite.com/otherpage')
        start += 1

my_func(1,2)

So let's say https://www.somerandomsite.com/1 has 8 elements of class 'some_class', https://www.somerandomsite.com/otherpage has 7 elements of class 'some_class', and https://www.somerandomsite.com/2 has 10 elements of class 'some_class'.

When I run the above code, it'll print 8, then 17. If I don't navigate to the other page, and run

driver=webdriver.PhantomJS()
def my_func(start,end)
    while start <= end:
        driver.get('https://www.somerandomsite.com/'+str(start))
        number_of_elements = len(driver.find_elements_by_class_name('some_class'))
        start += 1

So my code's like this:

driver=webdriver.PhantomJS()
def my_func(start,end)
    while start <= end:
        driver.get('https://www.somerandomsite.com/'+str(start))
        number_of_elements = len(driver.find_elements_by_class_name('some_class'))
        print(number_of_elements)
        start += 1

my_func(1,2)

it'll print 8 then 10, as I want it to. I'm not sure why it's counting elements on two pages at once, and only if I get that other page beforehand.

EDIT2: So I've gotten it working by navigating to a page on a different server and then returning to the page I want. Weird, but I'll use it. If anyone has any ideas on why it doesn't work if I don't though I'd still love to understand the problem better.

Upvotes: 0

Answers (2)

jlaur

Reputation: 740

Difficult to tell what - if at all - the problem is as you don't provide the necessary details to replicate what you're describing.

IMHO a function is overkill for this simple task. Just toss it and create the loop. In general I'd put the loop outside.

Also you need a function call for this to do anything at all - and a return statement.

In general for similar stuff I'd put the loop outside the function.

Like so:

def my_func(driver, count):
    driver.get('https://www.somerandomsite.com/%d' % count)
    number_of_elements = len(driver.find_elements_by_class_name('some_class'))
    return number_of_elements

driver=webdriver.PhantomJS() 
total_element_count = 0
count = 1
while count < 1000: # or whatever number you need
    number_of_elements = my_func(driver, count)
    total_element_count += number_of_elements
    print("[*] Elements for iteration %d: %d" % (count, number_of_elements))
    print("[*] Total count so far: %d" % total_element_count)
    count +=1

Upvotes: 1

pythad

Reputation: 4267

Take a look at

number_of_elements = len(driver.find_elements_by_class_name('some_class'))

You asign len of elements on each iteration, but instead you need to sum them, so your code should look like:

driver=webdriver.PhantomJS()
def my_func(start,end):
    count = 0
    while start <= end:
        driver.get('https://www.somerandomsite.com/'+str(start))
        count += len(driver.find_elements_by_class_name('some_class'))
        start += 1

Upvotes: 0

Selenium Python Selector Returning Too Many Values

Answers (2)

Related Questions