Reputation: 300
I am currently coding a python program that checks proxies if they respond and also measures how long it takes. The url I'm GETing is a public api that allows millions of requests per second (ipify.org), so that shouldn't be the bottleneck. Testing hundreds even thousands of course is slow if I set a timeout = 15s (i.e. 100 * 15s = 25 min), hence I introduced Threading in my program. The following behaviour occurs:
When I initiate 256 threads that processes a list of 5000 proxies, the 10 % of them that do respond have an increasing response time...
When I initiate only 16 threads, response times vary, i.e. proxies further down the list respond faster sometimes than the ones earlier tested (this is as it should)
I am more or less a networking beginner and the question now comes to my mind, what the limit is of threads/requests I should do per second without distorting the measurements !
def consumer(id):
while True:
if len(q)==0:
break
proxy = q.popleft()
# Give them a different and only small overhead to avoid simultaneous tcp/ip bombing... (maybe ??)
time.sleep(id*0.01)
s_t = time.time()
state = check_proxy(proxy)
response_time = time.time()-s_t
proxy_list.append({
'proxy_ip': proxy,
'working': state[0],
'resp_time': response_time if state[0] else None
})
threads = []
# 256 Threads
for i in range(256):
t = Thread(target=consumer, args=(i,))
t.daemon = True
t.start()
threads.append(t)
for thr in threads:
thr.join()
def check_proxy(proxy, conn_type='http', site='http://api.ipify.org', timeout=15):
# Format to i.e. { "http": "http://183.207.232.119:8080" }
proxy_req = {conn_type: "%s://%s" % (conn_type, proxy.rstrip())}
try:
r = requests.get(site, proxies=proxy_req, timeout=timeout)
return True, r
except requests.exceptions.RequestException as e: # This is the correct syntax
return False, e
[758 rows x 3 columns]
proxy_ip working resp_time
26 212.66.42.98:8080 True 1.417061
60 50.97.212.199:3128 True 2.986519
62 23.88.238.46:8081 True 2.002400
63 183.207.229.202:80 True 2.452403
64 183.207.229.194:80 True 2.283683
65 183.207.229.195:80 True 2.501426
66 60.194.100.51:80 True 2.108991
67 83.222.221.137:8080 True 3.075372
68 37.239.46.26:80 True 2.776244
69 80.94.114.197:80 True 1.707185
71 41.75.201.146:8080 True 3.287514
72 42.202.146.58:8080 True 3.874238
75 222.45.196.19:8118 True 3.375033
76 120.202.249.230:80 True 2.778418
77 222.124.198.136:3129 True 2.638542
78 61.184.192.42:80 True 3.474871
79 101.251.238.123:8080 True 2.216384
80 222.87.129.218:80 True 2.541614
81 113.6.252.139:808 True 4.340471
82 218.240.156.82:80 True 3.737869
83 221.176.14.72:80 True 2.408369
84 58.253.238.242:80 True 4.351352
86 219.239.236.49:8888 True 4.693788
87 222.88.236.236:83 True 5.213140
88 119.6.144.73:82 True 3.002683
.. ... ... ...
256 36.85.88.179:8080 True 10.218517
257 117.21.192.9:80 True 10.322229
258 120.193.146.95:83 True 6.408998
259 91.241.18.129:3129 True 7.596714
260 58.213.19.134:2311 True 6.430531
261 27.131.190.66:8080 True 8.047689
262 222.88.236.236:82 True 8.649196
263 119.6.144.73:83 True 8.205048
265 176.31.138.187:80 True 11.444282
266 195.88.192.144:8080 True 6.716996
267 91.188.39.232:8888 True 7.986101
268 202.95.149.62:8080 True 12.453279
269 121.31.5.188:8080 True 6.956209
271 5.53.16.183:8080 True 10.354440
272 37.187.101.28:3128 True 10.922564
273 60.207.63.124:8118 True 9.908007
274 223.195.87.101:8081 True 13.230916
275 89.251.103.130:8080 True 13.350009
276 121.14.138.56:81 True 12.367794
277 118.244.213.6:3128 True 9.533521
278 218.92.227.170:13669 True 12.410708
280 212.68.51.58:80 True 10.599926
446 190.121.148.229:8080 True 15.064356
450 220.132.214.103:9064 True 17.016748
451 164.138.237.251:8080 True 16.171984
454 222.124.28.188:8080 True 15.233777
455 62.176.13.22:8088 True 17.180487
456 82.146.44.39:443 True 15.448998
755 85.9.209.244:8080 True 26.002548
757 201.86.94.166:8080 True 25.771388
The proxies that got checked later clearly have a much longer response time. I tried shuffling the queue in the beginning to verify that the proxies further down my list weren't just slower which is indeed not the case, the result as seen here is reproducible.
Upvotes: 3
Views: 2319
Reputation: 24324
If you only have one single process, then you only get one slice of the CPU. That slice is divided between your 256 threads. That's potentially a lot of context switching.
multiprocessing
module)check_proxy
implementation is going to be the bottleneck (is it based on a socket select
function or some blocking implementation?)With that many threads and with the assumption you are using a regular desktop machine (are most of them 8-cores nowadays?) that is a lot of context switching. Using the requests
library may hide a lot of the boilerplate code you need, but you may not be using connection pooling properly.
With one process you only get so far. If there are N processes, you get 1 / N
of CPU time, but if you have 2 of N processes, you get 2 / N
of CPU time.
You should be better off using the multiprocessing
module which will use more cores, while this won't help at all make the responses faster it will expedite the handling of the responses.
Use select.select()
for more efficient I/O handling; this works for sockets too with socket.fileno()
.
requests
uses blocking IOHere are the docs: http://docs.python-requests.org/en/latest/user/advanced/#blocking-or-non-blocking
By default, you are using blocking IO. Look at the docs for alternatives.
Upvotes: 1