Can't figure out the right way to use rotation of proxies within a script to speed up the execution process

Question

I've created a script using python implementing rotation of proxies within it to fetch correct response from some links. This function get_proxy_list() produces proxies from a source. However, I've hardcoded 5 proxies within that function for brevity.

Now, you can see there are two more functions validate_proxies() and fetch_response(). This function validate_proxies() filters out working proxies from the list of crude proxies generated by get_proxy_list().

Finally, this function fetch_response() uses those working proxies to fetch correct response from the list of urls I've.

I don't know whether this function validate_proxies() should be of any use at all because I can use those crude proxies directly within fetch_response(). Moreover, most of the free proxies are short-lived, so by the time I try to filter out those crude proxies, the working proxies are already dead. However, the script runs very slowly even when it finds and uses working proxies.

I've tried with:

import random
import requests
from bs4 import BeautifulSoup

validation_link = 'https://icanhazip.com/'

target_links = [
    'https://stackoverflow.com/questions/tagged/web-scraping',
    'https://stackoverflow.com/questions/tagged/vba',
    'https://stackoverflow.com/questions/tagged/java'
]

working_proxies = []

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

def get_proxy_list():
    proxy_list = ['198.24.171.26:8001','187.130.139.197:8080','159.197.128.8:3128','119.28.56.116:808','85.15.152.39:3128']
    return proxy_list


def validate_proxies(proxies,link):
    proxy_url = proxies.pop(random.randrange(len(proxies)))
    while True:
        proxy = {'https': f'http://{proxy_url}'}
        try:
            res = requests.get(link,proxies=proxy,headers=headers,timeout=5)
            assert res.status_code==200
            working_proxies.append(proxy_url)
            if not proxies: break
            proxy_url = proxies.pop(random.randrange(len(proxies)))
        except Exception as e:
            print("error raised as:",str(e))
            if not proxies: break
            proxy_url = proxies.pop(random.randrange(len(proxies)))

    return working_proxies


def fetch_response(proxies,url):
    proxy_url = proxies.pop(random.randrange(len(proxies)))

    while True:
        proxy = {'https': f'http://{proxy_url}'}
        try:
            resp = requests.get(url, proxies=proxy, headers=headers, timeout=7)
            assert resp.status_code==200
            return resp
        except Exception as e:
            print("error thrown as:",str(e))
            if not proxies: return 
            proxy_url = proxies.pop(random.randrange(len(proxies)))


if __name__ == '__main__':
    proxies = get_proxy_list()
    working_proxy_list = validate_proxies(proxies,validation_link)

    print("working proxy list:",working_proxy_list)

    for target_link in target_links:
        print(fetch_response(working_proxy_list,target_link))

Question: what is the right way to use rotation of proxies within a script in order to make the execution faster?

Will Da Silva · Accepted Answer

I've made a few changes to your code that will hopefully help you:

Since you mentioned that the proxies are short-lived, the code now fetches new proxies and checks if they work on every request.
Checking if proxies is now done in parallel using a concurrent.futures.ThreadPoolExecutor. This means that instead of waiting up to 5 seconds for each proxy check to timeout, you will wait at most 5 seconds for all for them to timeout.
Instead of randomly choosing a proxy, the first proxy that is found to be working is used.
Type hints have been added.

import itertools as it
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from typing import Dict

from bs4 import BeautifulSoup
import requests


Proxy = Dict[str, str]

executor = ThreadPoolExecutor()

validation_link = 'https://icanhazip.com/'

target_links = [
    'https://stackoverflow.com/questions/tagged/web-scraping',
    'https://stackoverflow.com/questions/tagged/vba',
    'https://stackoverflow.com/questions/tagged/java'
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}


def get_proxy_list():
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text,"html.parser")
    proxies = [':'.join([item.select_one('td').text,item.select_one('td:nth-of-type(2)').text]) for item in soup.select('table.table tr') if ('yes' in item.text and 'elite proxy' in item.text)]
    return [{'https': f'http://{x}'} for x in proxies]


def validate_proxy(proxy: Proxy) -> Proxy:
    res = requests.get(validation_link, proxies=proxy, headers=headers, timeout=5)
    assert 200 == res.status_code
    return proxy


def get_working_proxy() -> Proxy:
    futures = [executor.submit(validate_proxy, x) for x in get_proxy_list()]
    for i in it.count():
        future = futures[i % len(futures)]
        try:
            working_proxy = future.result(timeout=0.01)
            for f in futures:
                f.cancel()
            return working_proxy
        except TimeoutError:
            continue
        except Exception:
            futures.remove(future)
            if not len(futures):
                raise Exception('No working proxies found') from None


def fetch_response(url: str) -> requests.Response:
    res = requests.get(url, proxies=get_working_proxy(), headers=headers, timeout=7)
    assert res.status_code == 200
    return res

Usage:

>>> get_working_proxy()
{'https': 'http://119.81.189.194:80'}
>>> get_working_proxy()
{'https': 'http://198.50.163.192:3129'}
>>> get_working_proxy()
{'https': 'http://191.241.145.22:6666'}
>>> get_working_proxy()
{'https': 'http://169.57.1.84:8123'}
>>> get_working_proxy()
{'https': 'http://182.253.171.31:8080'}

In each case, one of the proxies with the lowest latency is returned.

If you want to make the code even more efficient, and you can be almost certain that a working proxy will still be working in some short amount of time (e.g. 30 seconds), then you can upgrade this by putting the proxies into a TTL cache, and repopulating it as necessary, rather than finding a working proxy every time you call fetch_response. See https://stackoverflow.com/a/52128389/5946921 for how to implement a TTL cache in Python.

Can't figure out the right way to use rotation of proxies within a script to speed up the execution process

Answers (1)

Related Questions

Can&#39;t figure out the right way to use rotation of proxies within a script to speed up the execution process

Answers (1)

Related Questions

Can't figure out the right way to use rotation of proxies within a script to speed up the execution process