Reputation: 750
CODE :
def ValidateProxy(LIST_PROXIES):
'''
Checks if scraped proxies allow HTTPS connection
'''
for proxy in LIST_PROXIES:
print('using', proxy)
host, port = str(proxy).split(":")
try:
resp = requests.get('https://amazon.com',
proxies=dict(https=f'socks5://{host}:{port}'),
timeout=6)
except ConnectionError:
print(proxy, 'REMOVED')
LIST_PROXIES.remove(proxy)
print(len(LIST_PROXIES), 'PROXIES GATHERED')
if len(LIST_PROXIES) != 0:
return LIST_PROXIES
else:
return None
INPUT :
['46.4.96.137:1080', '138.197.157.32:1080', '138.68.240.218:1080'.....] #15 proxies
OUTPUT :
using 46.4.96.137:1080
46.4.96.137:1080 REMOVED
using 138.68.240.218:1080
138.68.240.218:1080 REMOVED
using 207.154.231.213:1080
207.154.231.213:1080 REMOVED
using 198.199.120.102:1080
198.199.120.102:1080 REMOVED
using 88.198.24.108:1080
88.198.24.108:1080 REMOVED
using 188.226.141.211:1080
188.226.141.211:1080 REMOVED
using 92.222.180.156:1080
92.222.180.156:1080 REMOVED
using 183.233.183.70:1081
183.233.183.70:1081 REMOVED
7 PROXIES GATHERED # len(LIST_PROXIES) == 7, so 8 are removed which are printed above
MY DOUBTS :
Why print('using', proxy)
is not getting executed everytime ? (becuase input list has 15 items and this line is printed only 8 times)
Are try and except both blocks getting executed everytime ? Becuase everytime REMOVED
is printed on console.
I want to function it like print('using', proxy)
for every proxy and if ConnectionError
then print(proxy, 'REMOVED')
and remove that proxy from list.
EDIT : FULL INPUT
['46.4.96.137:1080', '138.197.157.32:1080', '138.68.240.218:1080', '162.243.108.129:1080', '207.154.231.213:1080', '176.9.119.170:1080', '198.199.120.102:1080', '176.9.75.42:1080', '88.198.24.108:1080', '188.226.141.61:1080', '188.226.141.211:1080', '125.124.185.167:38801', '92.222.180.156:1080', '188.166.83.17:1080', '183.233.183.70:1081']
Upvotes: 0
Views: 194
Reputation: 692
I would separate the logic into two functions. Also, please follow PEP-8 (I did not point that in the original answer)
from typing import Iterable
import requests
def is_valid_proxy(proxy: str) -> bool:
try:
requests.get(
'https://amazon.com',
proxies={'https': f'socks5://{proxy}'},
timeout=6,
)
return True
except ConnectionError:
return False
def get_valid_proxies(proxies: Iterable[str]) -> list[str]:
return [proxy for proxy in proxies if is_valid_proxy(proxy)]
Instead of printing to stdout, you could use the logging module.
The problem is you are iterating over the LIST_PROXIES
and removing elements from it at the same time.
If you only want to iterate over the LIST_PROXIES
once, something like this could work:
def ValidateProxy(LIST_PROXIES):
index = 0
for i in range(len(LIST_PROXIES)):
proxy = LIST_PROXIES[index]
print('using', proxy)
host, port = str(proxy).split(":")
try:
resp = requests.get('https://amazon.com',
proxies=dict(https=f'socks5://{host}:{port}'),
timeout=6)
index += 1
except ConnectionError:
print(proxy, 'REMOVED')
LIST_PROXIES.pop(index) # Index is not incremented
print(len(LIST_PROXIES), 'PROXIES GATHERED')
if len(LIST_PROXIES) != 0:
return LIST_PROXIES
else:
return None
However, if iterating over the list twice is not a problem, you can just make a copy of the list, as Sy Ker pointed out.
Upvotes: 2
Reputation: 2180
You are removing items from a list you are iterating over. NOT GOOD. You should iterate over a copy of the list, leaving you free to modify the original. Simply replace for proxy in LIST_PROXIES:
with for proxy in list(LIST_PROXIES):
Upvotes: 1
Reputation: 2136
The issue is caused by the fact that you are mutating the list whilst you are still looping over it in this line.
LIST_PROXIES.remove(proxy)
This means that just before the for
loop looks for the 'next' item in the list, the 'next' item moves left in the list and therefore is missed completely.
Check out this question/answer: strange result when removing item from a list
Upvotes: 1