Reputation: 46
I'm trying to write a small program, that you input a search query, it opens your browswer with the result and then scrapes the google search result and prints it, i don't know how i would go along doing the scraping part. this all i have so far:
import webbrowser
query = input("What would you like to search: ")
for word in query:
query = query + "+"
webbrowser.open("https://www.google.com/search?q="+query)
Let's say they say type: "Who is donald trump?" Their browser will open and this will show: donald trump search result
How would i go along and scrape the summary provided by wikipedia and then have it be printed back to the user? Or in any case scrape any data from a website???
Upvotes: 2
Views: 6119
Reputation: 1734
To scrape just summary you can use select_one()
method provided by bs4
by selecting CSS
selector. You can use the SelectorGadget Chrome extension or any other to make a quick selection.
Make sure you're using a user-agent
, otherwise, Google could block your request because the default user-agent
will be python-requests (if you were using requests
library)
List of user-agents to fake user visit.
From there you can scrape every other part you want by using select_one()
method. Keep in mind that you can scrape info from Knowladge graph only if Google provides it. You can make an if
or try-except
statement to handle exceptions.
Code and full example:
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=who is donald trump', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = soup.select_one('.Uo8X3b+ span').text
print(summary)
Output:
Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.
An alternative way to do it using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out playground to see if it suits your needs.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "who is donald trump",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
summary = results["knowledge_graph"]['description']
print(summary)
Output:
Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.
Disclaimer I work for SerpApi.
Upvotes: 2
Reputation: 546
I have used selenium web driver. And extracted the google results snippets successfully.
from selenium import webdriver
browser = webdriver.Chrome(path\chromedriver')
#specify path of chrome driver
browser.get('http://google.co.in/')
sbar = browser.find_element_by_id('lst-ib')
sbar.send_keys(x) # x is the query
sbar.send_keys(Keys.ENTER)
#elements on search page of google are having different class and ids so we have to try among severals to get an answer.
try:
elem = browser.find_element_by_css_selector('div.MUxGbd.t51gnb.lyLwlc.lEBKkf')
except:
pass
try:
elem = browser.find_element_by_css_selector('span.ILfuVd.yZ8quc')
except:
pass
try:
elem = browser.find_element_by_css_selector('div.Z0LcW')
except:
pass
print (elem.text)
I hope it helps. If you find errors please let know! Ps. Take care of indentation
Note: you should have driver for the browser you will be using.
Upvotes: 0
Reputation: 21
Above code works good except ID. with id="rhs_block"
I don't get any results. Instead I used id="res"
. Maybe that's updated recently
Upvotes: -1
Reputation: 1912
Although there are really quite a few ways you can scrape data, I've demonstrated this using a library called BeautifulSoup
. I believe it's a much more flexible option than using webbrowser
to scrape data. Don't worry if this seems new to you, I'll walk you through the steps.
BeautifulSoup
and requests
modules. If you don't have them, install them with pip.
import requests
from bs4 import BeautifulSoup
Get the user input and save it to a variable:
query = input("What would you like to search: ")
query = query.replace(" ","+")
query = "https://www.google.com/search?q=" + query
Use the requests
module to send a GET request to the host:
r = requests.get(query)
html_doc = r.text
Instantiate a BeautifulSoup
object:
soup = BeautifulSoup(html_doc, 'html.parser')
Finally scrape the desired text:
for s in soup.find_all(id="rhs_block"):
print(s.text)
Notice the ID. This ID is the container where Google puts all the snippet text. In this way, it will literally spit out all the text it finds inside this container, but you can, of course, format it to look a little neater.
By the way, if you happen to run into a UnicodeEncodeError
, you'll have to append .encode('utf-8')
to the end of each text
property.
Let me know if you have any more questions. Cheers!
Upvotes: 1