whiterock
whiterock

Reputation: 300

Threading slowing down response time - python

I am currently coding a python program that checks proxies if they respond and also measures how long it takes. The url I'm GETing is a public api that allows millions of requests per second (ipify.org), so that shouldn't be the bottleneck. Testing hundreds even thousands of course is slow if I set a timeout = 15s (i.e. 100 * 15s = 25 min), hence I introduced Threading in my program. The following behaviour occurs:

When I initiate 256 threads that processes a list of 5000 proxies, the 10 % of them that do respond have an increasing response time...

When I initiate only 16 threads, response times vary, i.e. proxies further down the list respond faster sometimes than the ones earlier tested (this is as it should)

I am more or less a networking beginner and the question now comes to my mind, what the limit is of threads/requests I should do per second without distorting the measurements !

The code I use:

def consumer(id):
    while True:
        if len(q)==0:
            break
        proxy = q.popleft()

        # Give them a different and only small overhead to avoid simultaneous tcp/ip bombing... (maybe ??)
        time.sleep(id*0.01)

        s_t = time.time()
        state = check_proxy(proxy)
        response_time = time.time()-s_t

        proxy_list.append({
            'proxy_ip': proxy,
            'working': state[0],
            'resp_time': response_time if state[0] else None
        })

threads = []

# 256 Threads
for i in range(256):
    t = Thread(target=consumer, args=(i,))
    t.daemon = True
    t.start()
    threads.append(t)

for thr in threads:
    thr.join()

The check_proxy function:

def check_proxy(proxy, conn_type='http', site='http://api.ipify.org', timeout=15):
# Format to i.e. { "http": "http://183.207.232.119:8080" }
proxy_req = {conn_type: "%s://%s" % (conn_type, proxy.rstrip())}

try:
    r = requests.get(site, proxies=proxy_req, timeout=timeout)
    return True, r
except requests.exceptions.RequestException as e:    # This is the correct syntax
    return False, e

Test Results with 1000 Threads and requests:

[758 rows x 3 columns]
                 proxy_ip working  resp_time
26      212.66.42.98:8080    True   1.417061
60     50.97.212.199:3128    True   2.986519
62      23.88.238.46:8081    True   2.002400
63     183.207.229.202:80    True   2.452403
64     183.207.229.194:80    True   2.283683
65     183.207.229.195:80    True   2.501426
66       60.194.100.51:80    True   2.108991
67    83.222.221.137:8080    True   3.075372
68        37.239.46.26:80    True   2.776244
69       80.94.114.197:80    True   1.707185
71     41.75.201.146:8080    True   3.287514
72     42.202.146.58:8080    True   3.874238
75     222.45.196.19:8118    True   3.375033
76     120.202.249.230:80    True   2.778418
77   222.124.198.136:3129    True   2.638542
78       61.184.192.42:80    True   3.474871
79   101.251.238.123:8080    True   2.216384
80      222.87.129.218:80    True   2.541614
81      113.6.252.139:808    True   4.340471
82      218.240.156.82:80    True   3.737869
83       221.176.14.72:80    True   2.408369
84      58.253.238.242:80    True   4.351352
86    219.239.236.49:8888    True   4.693788
87      222.88.236.236:83    True   5.213140
88        119.6.144.73:82    True   3.002683
..                    ...     ...        ...
256     36.85.88.179:8080    True  10.218517
257       117.21.192.9:80    True  10.322229
258     120.193.146.95:83    True   6.408998
259    91.241.18.129:3129    True   7.596714
260    58.213.19.134:2311    True   6.430531
261    27.131.190.66:8080    True   8.047689
262     222.88.236.236:82    True   8.649196
263       119.6.144.73:83    True   8.205048
265     176.31.138.187:80    True  11.444282
266   195.88.192.144:8080    True   6.716996
267    91.188.39.232:8888    True   7.986101
268    202.95.149.62:8080    True  12.453279
269     121.31.5.188:8080    True   6.956209
271      5.53.16.183:8080    True  10.354440
272    37.187.101.28:3128    True  10.922564
273    60.207.63.124:8118    True   9.908007
274   223.195.87.101:8081    True  13.230916
275   89.251.103.130:8080    True  13.350009
276      121.14.138.56:81    True  12.367794
277    118.244.213.6:3128    True   9.533521
278  218.92.227.170:13669    True  12.410708
280       212.68.51.58:80    True  10.599926
446  190.121.148.229:8080    True  15.064356
450  220.132.214.103:9064    True  17.016748
451  164.138.237.251:8080    True  16.171984
454   222.124.28.188:8080    True  15.233777
455     62.176.13.22:8088    True  17.180487
456      82.146.44.39:443    True  15.448998
755     85.9.209.244:8080    True  26.002548
757    201.86.94.166:8080    True  25.771388

The proxies that got checked later clearly have a much longer response time. I tried shuffling the queue in the beginning to verify that the proxies further down my list weren't just slower which is indeed not the case, the result as seen here is reproducible.

Upvotes: 3

Views: 2319

Answers (1)

dnozay
dnozay

Reputation: 24324

If you only have one single process, then you only get one slice of the CPU. That slice is divided between your 256 threads. That's potentially a lot of context switching.

  • use more processes to get more slices (there is a good multiprocessing module)
  • use less threads
  • your check_proxy implementation is going to be the bottleneck (is it based on a socket select function or some blocking implementation?)

With that many threads and with the assumption you are using a regular desktop machine (are most of them 8-cores nowadays?) that is a lot of context switching. Using the requests library may hide a lot of the boilerplate code you need, but you may not be using connection pooling properly.

more processes for more work

With one process you only get so far. If there are N processes, you get 1 / N of CPU time, but if you have 2 of N processes, you get 2 / N of CPU time.

You should be better off using the multiprocessing module which will use more cores, while this won't help at all make the responses faster it will expedite the handling of the responses.

low-level implementation

Use select.select() for more efficient I/O handling; this works for sockets too with socket.fileno().

requests uses blocking IO

Here are the docs: http://docs.python-requests.org/en/latest/user/advanced/#blocking-or-non-blocking

By default, you are using blocking IO. Look at the docs for alternatives.

Upvotes: 1

Related Questions