micah
micah

Reputation: 8096

Requests vs Curl

I have an application running on AWS that makes a request to a page to pull meta tags using requests. I'm finding that page is allowing curl requests, but not allowing requests from the requests library.

Works:

curl https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/

Hangs Forever:

imports requests
requests.get('https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/')

What is the difference between curl and requests here? Should I just spawn a curl process to make my requests?

Upvotes: 1

Views: 277

Answers (1)

bbd108
bbd108

Reputation: 998

Either of the agents below do indeed work. One can also use the user_agent module (located on pypi here) to generate random and valid web user agents.

import requests

agent = (
    "Mozilla/5.0 (X11; Linux x86_64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/85.0.4183.102 Safari/537.36"
)

# or can use
# agent = "curl/7.61.1"

url = ("https://www.seattletimes.com/nation-world/"
       "mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")

r = requests.get(url, headers={'user-agent': agent})

Or, using the user_agent module:

import requests
from user_agent import generate_user_agent

agent = generate_user_agent()

url = ("https://www.seattletimes.com/nation-world/"
       "mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")

r = requests.get(url, headers={'user-agent': agent})

To further explain, requests sets a default user agent here, and the seattle times is blocking this user agent. However, with python-requests one can easily change the header parameters in the request as shown above.

To illustrate the default parameters:

r = requests.get('https://google.com/')
print(r.request.headers)
>>> {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

vs. the updated header parameter

agent = "curl/7.61.1"
r = requests.get('https://google.com/', headers={'user-agent': agent})
print(r.request.headers)
>>>{'user-agent': 'curl/7.61.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Upvotes: 2

Related Questions