Morten
Morten

Reputation: 95

urlopen via urllib.request with valid User-Agent returns 405 error

My question is about the urllib module in python 3. The following piece of code

import urllib.request
import urllib.parse

url = "https://google.com/search?q=stackoverflow"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}

try:
    req = urllib.request.Request(url, headers=headers)
    resp = urllib.request.urlopen(req)
    file = open('googlesearch.txt.', 'w')
    file.write(str(resp.read()))
    file.close()

except Exception as e:
    print(str(e))

works as I expect and writes the content of the google search 'stackoverflow' in a file. We need to set a valid User-Agent, otherwise google does not allow the request and returns a 405 Invalid Method error.

I think the following piece of code

import urllib.request
import urllib.parse

url = "https://google.com/search"
values = {'q': 'stackoverflow'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}

data = urllib.parse.urlencode(values)
data = data.encode('utf-8')

try:
    req = urllib.request.Request(url, data=data, headers=headers)
    resp = urllib.request.urlopen(req)
    file = open('googlesearch.txt.', 'w')
    file.write(str(resp.read()))
    file.close()

except Exception as e:
    print(str(e))

should produce the same output as the first one, as it is the same google search with the same User-Agent. However, this piece of code throws an exception with message: 'HTTP Error 405: Method Not Allowed'.

My question is: what is wrong with the second piece of code? Why does it not produce the same output as the first one?

Upvotes: 1

Views: 836

Answers (2)

Håken Lid
Håken Lid

Reputation: 23064

You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).

Urllib sends a POST because you include the data argument in the Request constructor as is documented here:

https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.

It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.

import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status()  # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
    fp.write(response.text) 

https://github.com/requests/requests

Upvotes: 3

Dharmesh Fumakiya
Dharmesh Fumakiya

Reputation: 2338

https://docs.python.org/3.4/howto/urllib2.html#data

If you do not pass the data argument, urllib uses a GET request. One way in which GET and POST requests differ is that POST requests often have “side-effects”: they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door).

Upvotes: 3

Related Questions