Reputation: 534
I'm trying to scrape Transfermarkt, a football web. I'm trying to do web scraping but every time y try I get blocked at 7th request.
I have try to change headers and proxies but I always get the same result.
These are some "experiments" I did. These proxies works separated.
user_agent_list = [here are a lot of user agents]
headers = {'User-Agent':random.choice(user_agent_list)}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/14'
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://121.121.117.227:3128'})
print(r)
#Changing proxy
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
#Here I get blocked
r=requests.get(url,headers='User-Agent':random.choice(user_agent_list),proxies={'http': 'http://177.131.22.186:80'})
print(r)
#And continue trying with another examples
I have to remark that proxies are validated, so then work individually. What I get from prints are until I get blocked that I get . How should I solve it? Should I change another parameter from the get?
Upvotes: 0
Views: 709
Reputation: 12036
The main problem with your script is that you are trying to connect to https
server with http only
proxy. You need to set a proxy for https
:
proxies={'https': 'https://x.y.z.a:b'}
In your case you are only setting http
proxy, so https
requests are not going through it.
Please note that proxy servers you have given in your example don't support
https
.
Upvotes: 2