Reputation: 8096
I have an application running on AWS that makes a request to a page to pull meta tags using requests
. I'm finding that page is allowing curl requests, but not allowing requests from the requests
library.
Works:
curl https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/
Hangs Forever:
imports requests
requests.get('https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/')
What is the difference between curl and requests here? Should I just spawn a curl process to make my requests?
Upvotes: 1
Views: 277
Reputation: 998
Either of the agents below do indeed work. One can also use the user_agent module (located on pypi here) to generate random and valid web user agents.
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
# or can use
# agent = "curl/7.61.1"
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
Or, using the user_agent module:
import requests
from user_agent import generate_user_agent
agent = generate_user_agent()
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
To further explain, requests sets a default user agent here, and the seattle times is blocking this user agent. However, with python-requests one can easily change the header parameters in the request as shown above.
To illustrate the default parameters:
r = requests.get('https://google.com/')
print(r.request.headers)
>>> {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
vs. the updated header parameter
agent = "curl/7.61.1"
r = requests.get('https://google.com/', headers={'user-agent': agent})
print(r.request.headers)
>>>{'user-agent': 'curl/7.61.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Upvotes: 2