SMTH
SMTH

Reputation: 95

Can't figure out the right way to use rotation of proxies within a script to speed up the execution process

I've created a script using python implementing rotation of proxies within it to fetch correct response from some links. This function get_proxy_list() produces proxies from a source. However, I've hardcoded 5 proxies within that function for brevity.

Now, you can see there are two more functions validate_proxies() and fetch_response(). This function validate_proxies() filters out working proxies from the list of crude proxies generated by get_proxy_list().

Finally, this function fetch_response() uses those working proxies to fetch correct response from the list of urls I've.

I don't know whether this function validate_proxies() should be of any use at all because I can use those crude proxies directly within fetch_response(). Moreover, most of the free proxies are short-lived, so by the time I try to filter out those crude proxies, the working proxies are already dead. However, the script runs very slowly even when it finds and uses working proxies.

I've tried with:

import random
import requests
from bs4 import BeautifulSoup

validation_link = 'https://icanhazip.com/'

target_links = [
    'https://stackoverflow.com/questions/tagged/web-scraping',
    'https://stackoverflow.com/questions/tagged/vba',
    'https://stackoverflow.com/questions/tagged/java'
]

working_proxies = []

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

def get_proxy_list():
    proxy_list = ['198.24.171.26:8001','187.130.139.197:8080','159.197.128.8:3128','119.28.56.116:808','85.15.152.39:3128']
    return proxy_list


def validate_proxies(proxies,link):
    proxy_url = proxies.pop(random.randrange(len(proxies)))
    while True:
        proxy = {'https': f'http://{proxy_url}'}
        try:
            res = requests.get(link,proxies=proxy,headers=headers,timeout=5)
            assert res.status_code==200
            working_proxies.append(proxy_url)
            if not proxies: break
            proxy_url = proxies.pop(random.randrange(len(proxies)))
        except Exception as e:
            print("error raised as:",str(e))
            if not proxies: break
            proxy_url = proxies.pop(random.randrange(len(proxies)))

    return working_proxies


def fetch_response(proxies,url):
    proxy_url = proxies.pop(random.randrange(len(proxies)))

    while True:
        proxy = {'https': f'http://{proxy_url}'}
        try:
            resp = requests.get(url, proxies=proxy, headers=headers, timeout=7)
            assert resp.status_code==200
            return resp
        except Exception as e:
            print("error thrown as:",str(e))
            if not proxies: return 
            proxy_url = proxies.pop(random.randrange(len(proxies)))


if __name__ == '__main__':
    proxies = get_proxy_list()
    working_proxy_list = validate_proxies(proxies,validation_link)

    print("working proxy list:",working_proxy_list)

    for target_link in target_links:
        print(fetch_response(working_proxy_list,target_link))

Question: what is the right way to use rotation of proxies within a script in order to make the execution faster?

Upvotes: 1

Views: 929

Answers (1)

Will Da Silva
Will Da Silva

Reputation: 7040

I've made a few changes to your code that will hopefully help you:

  • Since you mentioned that the proxies are short-lived, the code now fetches new proxies and checks if they work on every request.
  • Checking if proxies is now done in parallel using a concurrent.futures.ThreadPoolExecutor. This means that instead of waiting up to 5 seconds for each proxy check to timeout, you will wait at most 5 seconds for all for them to timeout.
  • Instead of randomly choosing a proxy, the first proxy that is found to be working is used.
  • Type hints have been added.
import itertools as it
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from typing import Dict

from bs4 import BeautifulSoup
import requests


Proxy = Dict[str, str]

executor = ThreadPoolExecutor()

validation_link = 'https://icanhazip.com/'

target_links = [
    'https://stackoverflow.com/questions/tagged/web-scraping',
    'https://stackoverflow.com/questions/tagged/vba',
    'https://stackoverflow.com/questions/tagged/java'
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}


def get_proxy_list():
    response = requests.get('https://www.sslproxies.org/')
    soup = BeautifulSoup(response.text,"html.parser")
    proxies = [':'.join([item.select_one('td').text,item.select_one('td:nth-of-type(2)').text]) for item in soup.select('table.table tr') if ('yes' in item.text and 'elite proxy' in item.text)]
    return [{'https': f'http://{x}'} for x in proxies]


def validate_proxy(proxy: Proxy) -> Proxy:
    res = requests.get(validation_link, proxies=proxy, headers=headers, timeout=5)
    assert 200 == res.status_code
    return proxy


def get_working_proxy() -> Proxy:
    futures = [executor.submit(validate_proxy, x) for x in get_proxy_list()]
    for i in it.count():
        future = futures[i % len(futures)]
        try:
            working_proxy = future.result(timeout=0.01)
            for f in futures:
                f.cancel()
            return working_proxy
        except TimeoutError:
            continue
        except Exception:
            futures.remove(future)
            if not len(futures):
                raise Exception('No working proxies found') from None


def fetch_response(url: str) -> requests.Response:
    res = requests.get(url, proxies=get_working_proxy(), headers=headers, timeout=7)
    assert res.status_code == 200
    return res

Usage:

>>> get_working_proxy()
{'https': 'http://119.81.189.194:80'}
>>> get_working_proxy()
{'https': 'http://198.50.163.192:3129'}
>>> get_working_proxy()
{'https': 'http://191.241.145.22:6666'}
>>> get_working_proxy()
{'https': 'http://169.57.1.84:8123'}
>>> get_working_proxy()
{'https': 'http://182.253.171.31:8080'}

In each case, one of the proxies with the lowest latency is returned.

If you want to make the code even more efficient, and you can be almost certain that a working proxy will still be working in some short amount of time (e.g. 30 seconds), then you can upgrade this by putting the proxies into a TTL cache, and repopulating it as necessary, rather than finding a working proxy every time you call fetch_response. See https://stackoverflow.com/a/52128389/5946921 for how to implement a TTL cache in Python.

Upvotes: 2

Related Questions