Reputation: 534
I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?
Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.
from fake_useragent import UserAgent
import requests
def get_player(id,proxy):
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)
try:
print(proxy)
r=requests.get(u,headers=headers,proxies=proxy)
execpt:
....
code to manage the data
....
def get_proxies():
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://free-proxy-list.net/'
r=requests.get(url,headers=headers)
page = BeautifulSoup(r.text, 'html.parser')
proxies=[]
for proxy in page.find_all('tr'):
i=ip=port=0
for data in proxy.find_all('td'):
if i==0:
ip=data.get_text()
if i==1:
port=data.get_text()
i+=1
if ip!=0 and port!=0:
proxies+=[{'http':'http://'+ip+':'+port}]
return proxies
proxies=get_proxies()
for i in range(1,100):
player=get_player(i,proxies[i//4])
....
code to manage the data
....
I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.
Upvotes: 21
Views: 49471
Reputation: 839
I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.
Instead, you can use my requests-ip-rotator Python library to proxy traffic through AWS API Gateway, which gives you a new IP each time: pip install requests-ip-rotator
This can be used as follows (for your site specifically):
import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()
session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)
response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)
# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown()
Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.
The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.
Upvotes: 37
Reputation: 21436
Presumably you have your own pool of proxies - what is the best way to rotate them?
First, blindly picking random proxy we risk of repeating connection from the same proxy multiple times in a row. To add, most connection pattern based blocking is using proxy subnet (3rd number) rather than host - it's best to prevent repeats at subnet level.
It's also a good idea to track proxy performance as not all proxies are equal - we want to use our better performing proxies more often and let dead proxies cooldown.
All of this can be done with weighted randomization which is implemented by Python's random.choices()
function:
import random
from time import time
from typing import List, Literal
class Proxy:
"""container for a proxy"""
def __init__(self, ip, type_="datacenter") -> None:
self.ip: str = ip
self.type: Literal["datacenter", "residential"] = type_
_, _, self.subnet, self.host = ip.split(":")[0].split('.')
self.status: Literal["alive", "unchecked", "dead"] = "unchecked"
self.last_used: int = None
def __repr__(self) -> str:
return self.ip
def __str__(self) -> str:
return self.ip
class Rotator:
"""weighted random proxy rotator"""
def __init__(self, proxies: List[Proxy]):
self.proxies = proxies
self._last_subnet = None
def weigh_proxy(self, proxy: Proxy):
weight = 1_000
if proxy.subnet == self._last_subnet:
weight -= 500
if proxy.status == "dead":
weight -= 500
if proxy.status == "unchecked":
weight += 250
if proxy.type == "residential":
weight += 250
if proxy.last_used:
_seconds_since_last_use = time() - proxy.last_used
weight += _seconds_since_last_use
return weight
def get(self):
proxy_weights = [self.weigh_proxy(p) for p in self.proxies]
proxy = random.choices(
self.proxies,
weights=proxy_weights,
k=1,
)[0]
proxy.last_used = time()
self.last_subnet = proxy.subnet
return proxy
If we mock run this Rotator we can see how weighted randoms distribute our connections:
from collections import Counter
if __name__ == "__main__":
proxies = [
# these will be used more often
Proxy("xx.xx.121.1", "residential"),
Proxy("xx.xx.121.2", "residential"),
Proxy("xx.xx.121.3", "residential"),
# these will be used less often
Proxy("xx.xx.122.1"),
Proxy("xx.xx.122.2"),
Proxy("xx.xx.123.1"),
Proxy("xx.xx.123.2"),
]
rotator = Rotator(proxies)
# let's mock some runs:
_used = Counter()
_failed = Counter()
def mock_scrape():
proxy = rotator.get()
_used[proxy.ip] += 1
if proxy.host == "1": # simulate proxies with .1 being significantly worse
_fail_rate = 60
else:
_fail_rate = 20
if random.randint(0, 100) < _fail_rate: # simulate some failure
_failed[proxy.ip] += 1
proxy.status = "dead"
mock_scrape()
else:
proxy.status = "alive"
return
for i in range(10_000):
mock_scrape()
for proxy, count in _used.most_common():
print(f"{proxy} was used {count:>5} times")
print(f" failed {_failed[proxy]:>5} times")
# will print:
# xx.xx.121.2 was used 2629 times
# failed 522 times
# xx.xx.121.3 was used 2603 times
# failed 508 times
# xx.xx.123.2 was used 2321 times
# failed 471 times
# xx.xx.122.2 was used 2302 times
# failed 433 times
# xx.xx.121.1 was used 1941 times
# failed 1187 times
# xx.xx.122.1 was used 1629 times
# failed 937 times
# xx.xx.123.1 was used 1572 times
# failed 939 times
By using weighted randoms we can create a connection pattern that appears random but smart. We can apply generic patterns like not proxies from the same IP family in a row as well as custom per-target logic like priotizing North American IPs for NA targets etc.
For more on this see my blog How to Rotate Proxies in Web Scraping
Upvotes: 4
Reputation: 111
import requests
from itertools import cycle
list_proxy = ['socks5://Username:Password@IP1:20000',
'socks5://Username:Password@IP2:20000',
'socks5://Username:Password@IP3:20000',
'socks5://Username:Password@IP4:20000',
]
proxy_cycle = cycle(list_proxy)
# Prime the pump
proxy = next(proxy_cycle)
for i in range(1, 10):
proxy = next(proxy_cycle)
print(proxy)
proxies = {
"http": proxy,
"https":proxy
}
r = requests.get(url='https://ident.me/', proxies=proxies)
print(r.text)
Upvotes: 11
Reputation: 5955
The problem with using free proxies from sites like this is
websites know about these and may block just because you're using one of them
you don't know that other people haven't gotten them blacklisted by doing bad things with them
the site is likely using some form of other identifier to track you across proxies based on other characteristics (device fingerprinting, proxy-piercing, etc)
Unfortunately, there's not a lot you can do other than be more sophisticated (distribute across multiple devices, use VPN/TOR, etc) and risk your IP being blocked for attempting DDOS-like traffic or, preferably, see if the site has an API for access
Upvotes: 7