Reputation: 21
I am trying to Scrape Google Results using beautifulsoup
. The results I get back are not what is displayed on the screen. What is needed to convert the results to the real text I see on the screen?
I have only tried to print out the soup and it looks nothing like the Results on the screen.
search_item = 'site:Facebook.com Dentist gmail.com'
url="https://www.google.com/search?q=" + search_item
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
print(soup)
I want to be able to parse out the Title, URL, Phone Number, and Email from the Google Results:
4M Dentistry - About | Facebook - Brno
https://www.facebook.com/dentist.brno/about/
Rating: 5 - 25 votes
725 857 346 E-mail [email protected]. Dental Surgery ensures the complete dental care for children and adults. Restorative and aesthetic dentistry, prosthetics ...
Upvotes: 0
Views: 527
Reputation: 1724
This script scrapes Title, Url, Email, Rating, Votes, and Snippet.
The trickiest part is to scrape emails.
To scrape email address from Google Search Results you need to parse snippet
(summary) and then use regex
to find a pattern in this snippet
(summary) to grab @gmail
addresses.
To grab emails we can do this (it worked for me):
match_email = re.findall(r'[\w\.-]+@[\w\.-]+', snippet)
- will find email address.
email = '\n'.join(match_email)
- will convert them from list
to a string
and print them on a new line (\n
)
Code and example in online IDE:
import requests, lxml, re, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'site:Facebook.com Dentist gmail.com'}
html = requests.get(f'https://www.google.com/search?q=',
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
data = []
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
try:
rating = container.select_one('g-review-stars+ span').text
except:
rating = None
try:
votes = container.select_one('span:nth-child(3)').text
except:
votes = None
try:
price_range = container.select_one('#rso span~ span+ span').text
except:
price_range = None
snippet = container.select_one('.lyLwlc span').text
match_email = re.findall(r'[\w\.-]+@[\w\.-]+', snippet)
email = '\n'.join(match_email)
data.append({
'Title': title,
'Link': link,
'Email': email,
'Rating': rating,
'Votes': votes,
'Price_range': price_range,
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Part of the JSON output:
[
{
"Title": "LI Dental Group, LLP - Dentist & Dental Office - 452 Photos ...",
"Link": "https://www.facebook.com/lidentalgroup/about/",
"Email": "[email protected]",
"Rating": "Rating: 2.3",
"Votes": "3 votes"
},
{
"Title": "Perfect Smiles Dental Studio - General Dentist - Santa Clarita ...",
"Link": "https://www.facebook.com/dentistcanyoncountry/about/",
"Email": "[email protected]",
"Rating": null,
"Votes": null
},
{
"Title": "Forest Dental Center - General Dentist - Lynchburg, Virginia ...",
"Link": "https://www.facebook.com/ForestDentalCenter/about/",
"Email": "[email protected]",
"Rating": "Rating: 4.8",
"Votes": "203 votes"
}
]
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. A completely free trial is currently under development.
Code to integrate using SerpApi:
from serpapi import GoogleSearch
import os, json, re
params = {
"engine": "google",
"q": "site:Facebook.com Dentist gmail.com",
"api_key": os.getenv('API_KEY')
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
match_email = re.findall(r'[\w\.-]+@[\w\.-]+', snippet)
email = '\n'.join(match_email)
try:
rating = result['rich_snippet']['top']['detected_extensions']['rating']
except:
rating = None
try:
votes = result['rich_snippet']['top']['detected_extensions']['votes']
except:
votes = None
try:
price_range = result['rich_snippet']['top']['extensions'][2]
except:
price_range = None
data.append({
'title':title,
'link':link,
'email':email,
'rating':rating,
'votes':votes,
'price_range':price_range,
})
print(json.dumps(data, indent = 2, ensure_ascii = False))
Part of the JSON output:
[
{
"title": "LI Dental Group, LLP - Dentist & Dental Office - 452 Photos ...",
"link": "https://www.facebook.com/lidentalgroup/about/",
"email": "[email protected]",
"rating": 2.3,
"votes": 3,
"price_range": "Price range: $$"
},
{
"title": "Perfect Smiles Dental Studio - General Dentist - Santa Clarita ...",
"link": "https://www.facebook.com/dentistcanyoncountry/about/",
"email": "[email protected]",
"rating": null,
"votes": null,
"price_range": null
}
]
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 21
I figured it out. I had to convert results to UTF8 here si the code I used. Works great!
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
newvar = soup.decode("utf8")
print("doublej = ", newvar)
Upvotes: 0
Reputation: 8205
Google is known to give you different results if the User-Agent is missing in the request.
You can add it like this
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
response = requests.get(url, headers=headers)
Documentation: Custom Headers
Upvotes: 1