Reputation: 1120
So basically what I mean is, when I search https://www.google.com/search?q=turtles, the first result's href attribute is a google.com/url redirect. Now, I wouldn't mind this if I was just browsing the internet with my browser, but I am trying to get search results in python. So for this code:
import requests
from bs4 import BeautifulSoup
def get_web_search(query):
query = query.replace(' ', '+') # Replace with %20 also works
response = requests.get('https://www.google.com/search', params={"q":
query})
r_data = response.content
soup = BeautifulSoup(r_data, 'html.parser')
result_raw = []
results = []
for result in soup.find_all('h3', class_='r', limit=1):
result_raw.append(result)
for result in result_raw:
results.append({
'url' : result.find('a').get('href'),
'text' : result.find('a').get_text()
})
print(results)
get_web_search("turtles")
I would expect
[{ url : "https://en.wikipedia.org/wiki/Turtle", text : "Turtle - Wikipedia" }]
But what I get instead is
[{'url': '/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN', 'text': 'Turtle - Wikipedia'}
Is there something I am missing here? Do I need to provide a different header or some other request parameter? Any help is appreciated. Thank you.
NOTE: I saw other posts about this but I am a beginner so I couldn't understand those as they were not in python
Upvotes: 1
Views: 2125
Reputation: 1724
You can use CSS
selectors to grab those links.
soup.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=turtles', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
# iterates over organic results container
for result in soup.select('.tF2Cxc'):
# extracts url from "result" container
url = result.select_one('.yuRUbf a')['href']
print(url)
------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.worldwildlife.org/species/sea-turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://www.fisheries.noaa.gov/sea-turtles
https://www.fisheries.noaa.gov/species/green-turtle
https://turtlesurvival.org/
https://www.outdooralabama.com/reptiles/turtles
https://www.rewild.org/lost-species/lost-turtles
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi.
It's a paid API with a free trial of 5,000 searches and the main difference here is that all you have to do is to navigate through structured JSON
rather than figuring out why stuff doesn't work.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "turtle",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['link'])
--------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://turtlesurvival.org/
https://www.worldwildlife.org/species/sea-turtle
https://www.conserveturtles.org/
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 22440
You can do the same using selenium in combination with python and BeautifulSoup. It will give you the first result no matter whether the webpage is javascript enable or a general one:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_data(search_input):
search_input = search_input.replace(" ","+")
driver.get("https://www.google.com/search?q=" + search_input)
soup = BeautifulSoup(driver.page_source,'lxml')
for result in soup.select('h3.r'):
item = result.select("a")[0].text
link = result.select("a")[0]['href']
print("item_text: {}\nitem_link: {}".format(item,link))
break
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_data("turtles")
finally:
driver.quit()
Output:
item_text: Turtle - Wikipedia
item_link: https://en.wikipedia.org/wiki/Turtle
Upvotes: 0
Reputation: 777
Just follow the link's redirect, and it will goto the right page. Assume your link is in the url
variable.
import urllib2
url = "/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN"
url = "www.google.com"+url
response = urllib2.urlopen(url) # 'www.google.com/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN'
response.geturl() # 'https://en.wikipedia.org/wiki/Turtle'
This works, since you are getting back google's redirect to the url which is what you are really clicking everytime you search. This code, basically just follows the redirect until it arrives at the real url.
Upvotes: 1
Reputation: 778
Use this package that provides google search
https://pypi.python.org/pypi/google
Upvotes: 0