user5505459
user5505459

Reputation:

How can I scrape the first link of a google search with beautiful soup

I'm trying to make a script that will scrape the first link of a google search so that it will give me back only the first link so I can run a search in the terminal and look at the link later on with the search term. I'm struggling to only get the first result. This is the closest thing I've got so far.

import requests
from bs4 import BeautifulSoup

research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later


r = requests.get(goog_search)    
soup = BeautifulSoup(r.text)  

for link in soup.find_all('a'):
    print research_later + " :"+link.get('href')

Upvotes: 12

Views: 12028

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You can use either select_one() for selecting CSS selectors or find() bs4 methods to get only one element from the page. To grab CSS selectors you can use SelectorGadget extension.

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=ice cream', headers=headers).text
soup = BeautifulSoup(html, 'lxml')

# locating .tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.select('.yuRUbf a')['href']
print(link)

# https://en.wikipedia.org/wiki/Ice_cream

Alternatively, you can do the same thing by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

The main difference is that everything (selecting, bypass blocks, proxy rotation, and more) is already done for the end-user with a JSON output.

Code to integrate:

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] - first index from the search results
link = results['organic_results'][0]['link']
print(link)

# https://en.wikipedia.org/wiki/Ice_cream

Disclaimer, I work for SerpApi.

Upvotes: 1

Remi Guan
Remi Guan

Reputation: 22292

Seems like Google use cite tag to save the link, so we can just use soup.find('cite').text like this:

import requests
from bs4 import BeautifulSoup

research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later


r = requests.get(goog_search)

soup = BeautifulSoup(r.text, "html.parser")
print soup.find('cite').text

Output is:

www.urbandictionary.com/define.php?term=hiya

Upvotes: 10

Related Questions