Reputation: 1690
I have this code that loops over Top Alexa 1000
websites, and get the ones that allow Sign Up
or Login
in any form. If there is a website in one of the iterations of this loop that gets stuck or throws Exception
in any form, I remove that from my list, and start the loop over again with the next element. I am using the selenium
package in Python
to do this. It works fine, except the fact that for some reason its looping over every other element in my alexa_1000
list-containing variable (i.e. skipping one element), rather than going over each element. Could anyone please help? There doesn't seem to be anything wrong with the code that I can see, and I have been debugging it to see the program flow, but really can't figure out whats happening. The general flow of the program seems to be fine. When I print the index of each loop, to see the nature of skipping, that also seems fine in the sense of going from 0 to 1 to 2 to 3. Would be glad of any help. Here's the code:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = get_alexa_top_pages()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
alexa_1000.remove(page)
except TimeoutException as ex:
alexa_1000.remove(page)
continue
except Exception as ex:
alexa_1000.remove(page)
continue
out.close()
if __name__ == "__main__":
main()
Upvotes: 0
Views: 156
Reputation: 146510
There are different ways to get rid of the issue. The issue is because you are touching a enumeration while enumerating it. That should always be avoided. You can do that by rewriting your code
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = set(get_alexa_top_pages())
alexa_invalid = set()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
except TimeoutException as ex:
alexa_invalid.add(page)
continue
except Exception as ex:
alexa_invalid.add(page)
continue
alexa_valid = alexa_1000 - alexa_invalid
out.close()
if __name__ == "__main__":
main()
In this you use set, one for looping and one for maintaining the list of invalid ones. If exception occurs you update the invalid one. At the end you can subtract the two to find valid sites as well
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = get_alexa_top_pages()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000[::-1]):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
except TimeoutException as ex:
alexa_1000.pop()
continue
except Exception as ex:
alexa_1000.pop()
continue
out.close()
if __name__ == "__main__":
main()
In this you loop in reverse order and the ones which error out, you just pop them out. At the end alexa_1000
will have all the valid websites which you processed
There would be lot many ways to approach this, above show just 2 of them that you can ideally use
Upvotes: 2