Reputation: 1043
I'm trying to extract a link from a google search result. Inspect element tells me that the section I am interested in has "class = r". The first result looks like this:
<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
<a href="https://en.wikipedia.org/wiki/Chocolate"
ping="/url?sa=t&source=web&rct=j&url=https://en.wikipedia.org/wiki/Chocolate&ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM"
saprocessedanchor="true">
Chocolate - Wikipedia
</a>
</h3>
To extract the "href" I do:
import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")
But I unexpectedly get:
'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'
Where I wanted:
"https://en.wikipedia.org/wiki/Chocolate"
The attribute "ping" seems to be confusing it. Any ideas?
Upvotes: 5
Views: 12849
Reputation: 1724
As the other answer mentioned, it's because there was no user-agent
specified. The default requests
user-agent
is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit.
User-agent
fakes user visit by adding this information into HTTP request headers. It can be done by passing custom headers (check what's yours user-agent
):
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
Additionally, to get more accurate results you can pass URL parameters:
params = {
"q": "samurai cop, what does katana mean", # query
"gl": "in", # country to search from
"hl": "en" # language
# other parameters
}
requests.get("YOUR_URL", params=params)
Code and full example in the online IDE (code from another answer will throw an error because of CSS
selector change):
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "in",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(f'{title}\n{link}\n')
-------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
...
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast, rather than figuring out why certain things don't work as they should and then maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "in",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
print()
------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
...
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 7248
If you print the response content (i.e. googleSoup.text
) you'll see that you're getting a completely different HTML. The page source and the response content don't match.
This is not happening because the content is loaded dynamically; as even then, the page source and the response content are the same. (But the HTML you see while inspecting the element is different.)
A basic explanation for this is that Google recognizes the Python script and changes its response.
You can pass a fake User-Agent to make the script look like a real browser request.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
elements = soup.select('.r a')
print(elements[0]['href'])
Output:
https://en.wikipedia.org/wiki/Chocolate
Upvotes: 12