ICanKindOfCode
ICanKindOfCode

Reputation: 1120

Google search gives redirect url, not real url python

So basically what I mean is, when I search https://www.google.com/search?q=turtles, the first result's href attribute is a google.com/url redirect. Now, I wouldn't mind this if I was just browsing the internet with my browser, but I am trying to get search results in python. So for this code:

import requests
from bs4 import BeautifulSoup

def get_web_search(query):
    query = query.replace(' ', '+') # Replace with %20 also works
    response = requests.get('https://www.google.com/search', params={"q": 
    query})
    r_data = response.content
    soup = BeautifulSoup(r_data, 'html.parser')
    result_raw = []
    results = []
    for result in soup.find_all('h3', class_='r', limit=1):
        result_raw.append(result) 

    for result in result_raw:
        results.append({
            'url' : result.find('a').get('href'),
            'text' : result.find('a').get_text()
        })

    print(results)

get_web_search("turtles")

I would expect

[{ url : "https://en.wikipedia.org/wiki/Turtle", text : "Turtle - Wikipedia" }]

But what I get instead is

[{'url': '/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN', 'text': 'Turtle - Wikipedia'}

Is there something I am missing here? Do I need to provide a different header or some other request parameter? Any help is appreciated. Thank you.

NOTE: I saw other posts about this but I am a beginner so I couldn't understand those as they were not in python

Upvotes: 1

Views: 2125

Answers (4)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You can use CSS selectors to grab those links.

soup.select_one('.yuRUbf a')['href']

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    "Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=turtles', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
# iterates over organic results container
for result in soup.select('.tF2Cxc'):
    # extracts url from "result" container 
    url = result.select_one('.yuRUbf a')['href']
    print(url)

------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.worldwildlife.org/species/sea-turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://www.fisheries.noaa.gov/sea-turtles
https://www.fisheries.noaa.gov/species/green-turtle
https://turtlesurvival.org/
https://www.outdooralabama.com/reptiles/turtles
https://www.rewild.org/lost-species/lost-turtles
'''

Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi.

It's a paid API with a free trial of 5,000 searches and the main difference here is that all you have to do is to navigate through structured JSON rather than figuring out why stuff doesn't work.

Code to integrate:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "turtle",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    print(result['link'])

--------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://turtlesurvival.org/
https://www.worldwildlife.org/species/sea-turtle
https://www.conserveturtles.org/
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

SIM
SIM

Reputation: 22440

You can do the same using selenium in combination with python and BeautifulSoup. It will give you the first result no matter whether the webpage is javascript enable or a general one:

from selenium import webdriver
from bs4 import BeautifulSoup

def get_data(search_input):
    search_input = search_input.replace(" ","+")
    driver.get("https://www.google.com/search?q=" + search_input)
    soup = BeautifulSoup(driver.page_source,'lxml')
    for result in soup.select('h3.r'):
        item = result.select("a")[0].text
        link = result.select("a")[0]['href']
        print("item_text: {}\nitem_link: {}".format(item,link))
        break

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        get_data("turtles")
    finally:
        driver.quit()

Output:

item_text: Turtle - Wikipedia
item_link: https://en.wikipedia.org/wiki/Turtle

Upvotes: 0

Noah Cristino
Noah Cristino

Reputation: 777

Just follow the link's redirect, and it will goto the right page. Assume your link is in the url variable.

import urllib2
url = "/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN"
url = "www.google.com"+url
response = urllib2.urlopen(url) # 'www.google.com/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN'
response.geturl() # 'https://en.wikipedia.org/wiki/Turtle'

This works, since you are getting back google's redirect to the url which is what you are really clicking everytime you search. This code, basically just follows the redirect until it arrives at the real url.

Upvotes: 1

Tilak Putta
Tilak Putta

Reputation: 778

Use this package that provides google search

https://pypi.python.org/pypi/google

Upvotes: 0

Related Questions