Reputation: 731
I have two URLs to fetch data from. Using my code, the first URL is working, whereas the second URL is giving ProxyError
.
I am using requests
library in Python 3 and tried searching the problem in Google and here, but with no success.
My code snippet is:
import requests
proxies = {
'http': 'http://user:[email protected]:xxxx',
'https': 'http://user:[email protected]:xxxx',
}
url1 = 'https://en.oxforddictionaries.com/definition/act'
url2 = 'https://dictionary.cambridge.org/dictionary/english/act'
r1 = requests.get(url1, proxies=proxies)
r2 = requests.get(url2, proxies=proxies)
url1
works fine, but url2
gives following error:
ProxyError: HTTPSConnectionPool(host='dictionary.cambridge.org', port=443): Max retries exceeded with url: /dictionary/english/act (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response',)))
Same happens on using request.post()
Please explain me why this is happening, and is there any difference between the handshaking of both the URLs?
urllib.request.urlopen
is working fine, so I am explicity looking for answers using requests
Upvotes: 8
Views: 38419
Reputation: 11
import re
import requests
import json
from bs4 import BeautifulSoup
import pymysql
import time, datetime
import os
from requests.adapters import HTTPAdapter
def get_random_proxy():
proxypool_url = 'http://127.0.0.1:5555/random'
"""
get random proxy from proxypool
:return: proxy
"""
return requests.get(proxypool_url).text.strip()
headers = {
'User-Agent': 'Chrome',
'Referer': 'https://www.nmpa.gov.cn/datasearch/home-index.html?79QlcAyHig6m=1636513393895',
'Host': 'nmpa.gov.cn',
'Origin': 'https://nmpa.gov.cn',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'close'
}
url = 'https://www.nmpa.gov.cn/datasearch/search-result.html'
def start_requests(coo):
# r = json.loads(r.text)
headers['Set-Cookie'] = coo
s = requests.get(url=url, headers=headers, stream=True, timeout=(5, 5), verify=False)
s.encoding = 'utf8'
print(s.status_code)
print(s)
while True:
proxy = {'http': 'http://' + get_random_proxy(), 'https': 'https://' + get_random_proxy()}
print(proxy)
try:
sess = requests.Session()
sess.keep_alive = False # 关闭多余连接
res = sess.get(url='https://nmpa.gov.cn', headers={'User-Agent': 'Chrome'}, proxies=proxy, timeout=10,
verify=False)
res.close()
print(res.status_code)
res.encoding = 'utf8'
cookie = res.headers['Set-Cookie']
print(cookie)
if res.status_code == 200:
print(res.status_code)
time.sleep(10)
start_requests(cookie)
break
except Exception as error:
time.sleep(10)
print("没有连接成功", error)
Upvotes: 0
Reputation: 341
I was able to illicit a valid response for url2
when using headers keyword argument with User-Agent
string set to Chrome
.
r2 = requests.get(url2, proxies=proxies, headers={'User-Agent': 'Chrome'})
To answer your first question, possible reason for this happening is related to server-side settings. It might be configured not to accept requests originating from unknown agents or requests with a missing User-Agent
header.
Upvotes: 6