Reputation: 115
I am trying to get the first non-ad result on a simple query on Google.
res = requests.get('https://www.google.com?q=' + query)
Assign any value to query and you will get an error. I have tried to add some headers, but nothing changes.
I have tried to add all other parameters that google typically associates to a query and again nothing changes.
No problems if you do the search with selenium.
The error code is 429, but this seems to be just a standard response for this query. It has nothing to do with my IP and I am not spamming Google, and this does not disappear after a while.
Do you know why this happens, and is there some header I can add, or any other solution to just see the results, as if you were searching that keyword on google?
Upvotes: 9
Views: 40217
Reputation: 1724
It's the most common question on the StackOverFlow that is being asked 200+ times in [requests]
and [bs4]
tags, and pretty much every solution lies down to simply adding user-agent
.
User-agent
is needed to act as a "real" user visit when the bot or browser sends a fake user-agent
string to announce themselves as a different client.
When no user-agent
is being passed to request headers
while using requests
library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request (or whatever it does) and you receive a different HTML (with some sort of an error) with different CSS
selectors. Check what's your user-agent
. List of user-agents
.
Note: Adding user-agent
doesn't mean that it will fix the problem and you still can get a 429 (or different) error, even when rotating user-agents
.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines. In short, you need:
Pass user-agent
:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work.
Disclaimer, I work for SerpApi.
Upvotes: 1
Reputation: 193218
The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests in a given amount of time ("rate limiting"). The response representations SHOULD include details explaining the condition, and MAY include a Retry-After
header indicating how long to wait before making a new request.
When a server is under attack or just receiving a very large number of requests from a single party, responding to each with a 429
status code will consume resources. Therefore, servers are not required to use the 429
status code; when limiting resource usage, it may be more appropriate to just drop connections, or take other steps.
However, when I took you code and executed the same test, I got the perfect result as follows:
Code Block:
import requests
query = "selenium"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
url = 'https://www.google.com/search?q=' + query
res = requests.get(url, headers=headers)
print(res)
Console Output:
<Response [200]>
You can find a relevant discussion in Failed to load resource: the server responded with a status of 429 (Too Many Requests) and 404 (Not Found) with ChromeDriver Chrome through Selenium
Upvotes: 5
Reputation: 41
I found reason why google simple query, rest-api request make 429 error.
user-agent header is one of reason, but I tried to insert user-agent header in request. but 429 error was made in response.
finally I found why, reason is cookies.
if you want access google page apis, first of all you have to get cookies from basic google urls like google.com, trend.google.com, YouTube.com. this basic site can be accessed by using any request method.
googleTrendsUrl = 'https://google.com'
response = requests.get(googleTrendsUrl)
if response.status_code == 200:
g_cookies = response.cookies.get_dict()
and this cookies insert into api request with user-agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)\
AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
url = 'https://www.google.com?q=' + query
res = requests.get(url, headers=headers, cookies=g_cookies)
Upvotes: 4
Reputation: 2689
Since you are getting status code 429
which means you have sent too many requests in a given amount of time ("rate limiting"). Read in more detail here.
Add Headers in your request just like this:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)\
AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
So the final request will be:
url = 'https://www.google.com?q=' + query
res = requests.get(url, headers=headers)
Upvotes: 5