How to not get blocked while scraping

Question

I'm trying to scrape Transfermarkt, a football web. I'm trying to do web scraping but every time y try I get blocked at 7th request.

I have try to change headers and proxies but I always get the same result.

These are some "experiments" I did. These proxies works separated.

user_agent_list = [here are a lot of user agents]
headers = {'User-Agent':random.choice(user_agent_list)}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/14'

r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)

#Changing proxy
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
#Here I get blocked
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
#And continue trying with another examples

I have to remark that proxies are validated, so then work individually. What I get from prints are until I get blocked that I get . How should I solve it? Should I change another parameter from the get?

Flash Thunder · Accepted Answer

The main problem with your script is that you are trying to connect to https server with http only proxy. You need to set a proxy for https:

proxies={'https': 'https://x.y.z.a:b'}

In your case you are only setting http proxy, so https requests are not going through it.

Please note that proxy servers you have given in your example don't support https.

How to not get blocked while scraping

Answers (1)

Related Questions