QPTR
QPTR

Reputation: 1690

Python loop skipping one element while looping

I have this code that loops over Top Alexa 1000 websites, and get the ones that allow Sign Up or Login in any form. If there is a website in one of the iterations of this loop that gets stuck or throws Exception in any form, I remove that from my list, and start the loop over again with the next element. I am using the selenium package in Python to do this. It works fine, except the fact that for some reason its looping over every other element in my alexa_1000 list-containing variable (i.e. skipping one element), rather than going over each element. Could anyone please help? There doesn't seem to be anything wrong with the code that I can see, and I have been debugging it to see the program flow, but really can't figure out whats happening. The general flow of the program seems to be fine. When I print the index of each loop, to see the nature of skipping, that also seems fine in the sense of going from 0 to 1 to 2 to 3. Would be glad of any help. Here's the code:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException


def get_alexa_top_pages():

    sites = []

    with open('topsites_1000.txt', 'r') as f:

        for line in f:
            line = line.strip('\n')
            sites.append(line)

    sites = filter(None, sites)


    sites = ['http://www.' + site for site in sites]

    return sites

def main():

    alexa_1000 = get_alexa_top_pages()
    out = open('sites_with_login.txt', 'w')
    sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']

    driver = webdriver.Firefox()
    driver.set_page_load_timeout(30)

    for index, page in enumerate(alexa_1000):


        try:

            print "Loading page %s (num %d)" %(page, index + 1)
            driver.get(page)
            html_source = driver.page_source
            html_source = html_source.lower()
            present = any([i in html_source for i in sign_in_strings])
            if present:
                out.write(page + '\n')

            alexa_1000.remove(page)

        except TimeoutException as ex:
            alexa_1000.remove(page)
            continue
        except Exception as ex:
            alexa_1000.remove(page)
            continue    


    out.close()

if __name__ == "__main__":
    main()

Upvotes: 0

Views: 156

Answers (1)

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

There are different ways to get rid of the issue. The issue is because you are touching a enumeration while enumerating it. That should always be avoided. You can do that by rewriting your code

Using sets instead of array

from selenium import webdriver
from selenium.common.exceptions import TimeoutException


def get_alexa_top_pages():

    sites = []

    with open('topsites_1000.txt', 'r') as f:

        for line in f:
            line = line.strip('\n')
            sites.append(line)

    sites = filter(None, sites)


    sites = ['http://www.' + site for site in sites]

    return sites

def main():

    alexa_1000 = set(get_alexa_top_pages())

    alexa_invalid = set()

    out = open('sites_with_login.txt', 'w')
    sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']

    driver = webdriver.Firefox()
    driver.set_page_load_timeout(30)

    for index, page in enumerate(alexa_1000):


        try:

            print "Loading page %s (num %d)" %(page, index + 1)
            driver.get(page)
            html_source = driver.page_source
            html_source = html_source.lower()
            present = any([i in html_source for i in sign_in_strings])
            if present:
                out.write(page + '\n')

        except TimeoutException as ex:
            alexa_invalid.add(page)
            continue
        except Exception as ex:
            alexa_invalid.add(page)
            continue    

    alexa_valid = alexa_1000 - alexa_invalid

    out.close()

if __name__ == "__main__":
    main()

In this you use set, one for looping and one for maintaining the list of invalid ones. If exception occurs you update the invalid one. At the end you can subtract the two to find valid sites as well

Use reverse array and pop

from selenium import webdriver
from selenium.common.exceptions import TimeoutException


def get_alexa_top_pages():

    sites = []

    with open('topsites_1000.txt', 'r') as f:

        for line in f:
            line = line.strip('\n')
            sites.append(line)

    sites = filter(None, sites)


    sites = ['http://www.' + site for site in sites]

    return sites

def main():

    alexa_1000 = get_alexa_top_pages()

    out = open('sites_with_login.txt', 'w')
    sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']

    driver = webdriver.Firefox()
    driver.set_page_load_timeout(30)

    for index, page in enumerate(alexa_1000[::-1]):


        try:

            print "Loading page %s (num %d)" %(page, index + 1)
            driver.get(page)
            html_source = driver.page_source
            html_source = html_source.lower()
            present = any([i in html_source for i in sign_in_strings])
            if present:
                out.write(page + '\n')

        except TimeoutException as ex:
            alexa_1000.pop()
            continue
        except Exception as ex:
            alexa_1000.pop()
            continue    

    out.close()

if __name__ == "__main__":
    main()

In this you loop in reverse order and the ones which error out, you just pop them out. At the end alexa_1000 will have all the valid websites which you processed

There would be lot many ways to approach this, above show just 2 of them that you can ideally use

Upvotes: 2

Related Questions