How to rotate proxies on a Python requests

I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?

Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.

from fake_useragent import UserAgent
import requests

def get_player(id,proxy):
    ua=UserAgent()
    headers = {'User-Agent':ua.random}

    url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)

    try:
        print(proxy)
        r=requests.get(u,headers=headers,proxies=proxy)
    execpt:

....
code to manage the data
....

Getting proxies

def get_proxies():
    ua=UserAgent()
    headers = {'User-Agent':ua.random}
    url='https://free-proxy-list.net/'

    r=requests.get(url,headers=headers)
    page = BeautifulSoup(r.text, 'html.parser')

    proxies=[]

    for proxy in page.find_all('tr'):
        i=ip=port=0

    for data in proxy.find_all('td'):
        if i==0:
            ip=data.get_text()
        if i==1:
            port=data.get_text()
        i+=1

    if ip!=0 and port!=0:
        proxies+=[{'http':'http://'+ip+':'+port}]

return proxies

Calling functions

proxies=get_proxies()
for i in range(1,100):
    player=get_player(i,proxies[i//4])

....
code to manage the data  
....

I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.

Upvotes: 21

Answers (4)

George

Reputation: 839

I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.

Instead, you can use my requests-ip-rotator Python library to proxy traffic through AWS API Gateway, which gives you a new IP each time: pip install requests-ip-rotator

This can be used as follows (for your site specifically):

import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS

gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()

session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)

response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)

# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown()

Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.

The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.

Upvotes: 37

Granitosaurus

Reputation: 21436

Presumably you have your own pool of proxies - what is the best way to rotate them?

First, blindly picking random proxy we risk of repeating connection from the same proxy multiple times in a row. To add, most connection pattern based blocking is using proxy subnet (3rd number) rather than host - it's best to prevent repeats at subnet level.

It's also a good idea to track proxy performance as not all proxies are equal - we want to use our better performing proxies more often and let dead proxies cooldown.

All of this can be done with weighted randomization which is implemented by Python's random.choices() function:

import random
from time import time
from typing import List, Literal


class Proxy:
    """container for a proxy"""

    def __init__(self, ip, type_="datacenter") -> None:
        self.ip: str = ip
        self.type: Literal["datacenter", "residential"] = type_
        _, _, self.subnet, self.host = ip.split(":")[0].split('.')
        self.status: Literal["alive", "unchecked", "dead"] = "unchecked"
        self.last_used: int = None

    def __repr__(self) -> str:
        return self.ip

    def __str__(self) -> str:
        return self.ip


class Rotator:
    """weighted random proxy rotator"""

    def __init__(self, proxies: List[Proxy]):
        self.proxies = proxies
        self._last_subnet = None

    def weigh_proxy(self, proxy: Proxy):
        weight = 1_000
        if proxy.subnet == self._last_subnet:
            weight -= 500
        if proxy.status == "dead":
            weight -= 500
        if proxy.status == "unchecked":
            weight += 250
        if proxy.type == "residential":
            weight += 250
        if proxy.last_used: 
            _seconds_since_last_use = time() - proxy.last_used
            weight += _seconds_since_last_use
        return weight

    def get(self):
        proxy_weights = [self.weigh_proxy(p) for p in self.proxies]
        proxy = random.choices(
            self.proxies,
            weights=proxy_weights,
            k=1,
        )[0]
        proxy.last_used = time()
        self.last_subnet = proxy.subnet
        return proxy

If we mock run this Rotator we can see how weighted randoms distribute our connections:

from collections import Counter

if __name__ == "__main__":
    proxies = [
        # these will be used more often
        Proxy("xx.xx.121.1", "residential"),
        Proxy("xx.xx.121.2", "residential"),
        Proxy("xx.xx.121.3", "residential"),
        # these will be used less often
        Proxy("xx.xx.122.1"),
        Proxy("xx.xx.122.2"),
        Proxy("xx.xx.123.1"),
        Proxy("xx.xx.123.2"),
    ]
    rotator = Rotator(proxies)

    # let's mock some runs:
    _used = Counter()
    _failed = Counter()
    def mock_scrape():
        proxy = rotator.get()
        _used[proxy.ip] += 1
        if proxy.host == "1":  # simulate proxies with .1 being significantly worse
            _fail_rate = 60
        else:
            _fail_rate = 20
        if random.randint(0, 100) < _fail_rate:  # simulate some failure
            _failed[proxy.ip] += 1
            proxy.status = "dead"
            mock_scrape()
        else:
            proxy.status = "alive"
            return
    for i in range(10_000):
        mock_scrape()

    for proxy, count in _used.most_common():
        print(f"{proxy} was used   {count:>5} times")
        print(f"                failed {_failed[proxy]:>5} times")

# will print:
# xx.xx.121.2 was used    2629 times
#                 failed   522 times
# xx.xx.121.3 was used    2603 times
#                 failed   508 times
# xx.xx.123.2 was used    2321 times
#                 failed   471 times
# xx.xx.122.2 was used    2302 times
#                 failed   433 times
# xx.xx.121.1 was used    1941 times
#                 failed  1187 times
# xx.xx.122.1 was used    1629 times
#                 failed   937 times
# xx.xx.123.1 was used    1572 times
#                 failed   939 times

By using weighted randoms we can create a connection pattern that appears random but smart. We can apply generic patterns like not proxies from the same IP family in a row as well as custom per-target logic like priotizing North American IPs for NA targets etc.

For more on this see my blog How to Rotate Proxies in Web Scraping

Upvotes: 4

John Wick

Reputation: 111

import requests
from itertools import cycle

list_proxy = ['socks5://Username:Password@IP1:20000',
              'socks5://Username:Password@IP2:20000',
              'socks5://Username:Password@IP3:20000',
               'socks5://Username:Password@IP4:20000',
              ]

proxy_cycle = cycle(list_proxy)
# Prime the pump
proxy = next(proxy_cycle)

for i in range(1, 10):
    proxy = next(proxy_cycle)
    print(proxy)
    proxies = {
      "http": proxy,
      "https":proxy
    }
    r = requests.get(url='https://ident.me/', proxies=proxies)
    print(r.text)

Upvotes: 11

G. Anderson

Reputation: 5955

The problem with using free proxies from sites like this is

websites know about these and may block just because you're using one of them
you don't know that other people haven't gotten them blacklisted by doing bad things with them
the site is likely using some form of other identifier to track you across proxies based on other characteristics (device fingerprinting, proxy-piercing, etc)

Unfortunately, there's not a lot you can do other than be more sophisticated (distribute across multiple devices, use VPN/TOR, etc) and risk your IP being blocked for attempting DDOS-like traffic or, preferably, see if the site has an API for access

Upvotes: 7

How to rotate proxies on a Python requests

Getting proxies

Calling functions

Answers (4)

Related Questions