uberdr3eam
uberdr3eam

Reputation: 46

Scrape google search snippet results

I'm trying to write a small program, that you input a search query, it opens your browswer with the result and then scrapes the google search result and prints it, i don't know how i would go along doing the scraping part. this all i have so far:

import webbrowser 
query = input("What would you like to search: ")
for word in query:
    query = query + "+"
webbrowser.open("https://www.google.com/search?q="+query)

Let's say they say type: "Who is donald trump?" Their browser will open and this will show: donald trump search result

How would i go along and scrape the summary provided by wikipedia and then have it be printed back to the user? Or in any case scrape any data from a website???

Upvotes: 2

Views: 6119

Answers (4)

Dmitriy Zub
Dmitriy Zub

Reputation: 1734

To scrape just summary you can use select_one() method provided by bs4 by selecting CSS selector. You can use the SelectorGadget Chrome extension or any other to make a quick selection.

Make sure you're using a user-agent, otherwise, Google could block your request because the default user-agent will be python-requests (if you were using requests library) List of user-agents to fake user visit.

From there you can scrape every other part you want by using select_one() method. Keep in mind that you can scrape info from Knowladge graph only if Google provides it. You can make an if or try-except statement to handle exceptions.

Code and full example:

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=who is donald trump', headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = soup.select_one('.Uo8X3b+ span').text
print(summary)

Output:

Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.

An alternative way to do it using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out playground to see if it suits your needs.

Example code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "who is donald trump",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

summary = results["knowledge_graph"]['description']
print(summary)

Output:

Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.

Disclaimer I work for SerpApi.

Upvotes: 2

Naazneen Jatu
Naazneen Jatu

Reputation: 546

I have used selenium web driver. And extracted the google results snippets successfully.

from selenium import webdriver
browser = webdriver.Chrome(path\chromedriver') 
#specify path of chrome driver
browser.get('http://google.co.in/')
sbar = browser.find_element_by_id('lst-ib')
sbar.send_keys(x) # x is the query
sbar.send_keys(Keys.ENTER)
#elements on search page of google are having different class and ids so we have to try among severals to get an answer.
try:
   elem = browser.find_element_by_css_selector('div.MUxGbd.t51gnb.lyLwlc.lEBKkf')
except:
   pass
try:
    elem = browser.find_element_by_css_selector('span.ILfuVd.yZ8quc')
except:
    pass
try:
    elem = browser.find_element_by_css_selector('div.Z0LcW')
except:
     pass
print (elem.text)

I hope it helps. If you find errors please let know! Ps. Take care of indentation

Note: you should have driver for the browser you will be using.

Upvotes: 0

Mayuri K
Mayuri K

Reputation: 21

Above code works good except ID. with id="rhs_block" I don't get any results. Instead I used id="res". Maybe that's updated recently

Upvotes: -1

Mangohero1
Mangohero1

Reputation: 1912

Although there are really quite a few ways you can scrape data, I've demonstrated this using a library called BeautifulSoup. I believe it's a much more flexible option than using webbrowser to scrape data. Don't worry if this seems new to you, I'll walk you through the steps.


You'll need BeautifulSoup and requests modules. If you don't have them, install them with pip.
Import the modules:

import requests
from bs4 import BeautifulSoup

Get the user input and save it to a variable:

query = input("What would you like to search: ")
query = query.replace(" ","+")
query = "https://www.google.com/search?q=" + query

Use the requests module to send a GET request to the host:

r = requests.get(query)
html_doc = r.text

Instantiate a BeautifulSoup object:

soup = BeautifulSoup(html_doc, 'html.parser')

Finally scrape the desired text:

for s in soup.find_all(id="rhs_block"):
   print(s.text)

Notice the ID. This ID is the container where Google puts all the snippet text. In this way, it will literally spit out all the text it finds inside this container, but you can, of course, format it to look a little neater.
By the way, if you happen to run into a UnicodeEncodeError, you'll have to append .encode('utf-8') to the end of each text property.
Let me know if you have any more questions. Cheers!

Upvotes: 1

Related Questions