Reputation: 21

Scraping Google Search Results

I am trying to Scrape Google Results using beautifulsoup. The results I get back are not what is displayed on the screen. What is needed to convert the results to the real text I see on the screen?

I have only tried to print out the soup and it looks nothing like the Results on the screen.

search_item = 'site:Facebook.com Dentist gmail.com'
url="https://www.google.com/search?q=" + search_item
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
print(soup)

I want to be able to parse out the Title, URL, Phone Number, and Email from the Google Results:

4M Dentistry - About | Facebook - Brno
https://www.facebook.com/dentist.brno/about/
 Rating: 5 - ‎25 votes
725 857 346 E-mail [email protected]. Dental Surgery ensures the complete dental care for children and adults. Restorative and aesthetic dentistry, prosthetics ...

Upvotes: 0

Answers (3)

Dmitriy Zub

Reputation: 1724

This script scrapes Title, Url, Email, Rating, Votes, and Snippet.

The trickiest part is to scrape emails. To scrape email address from Google Search Results you need to parse snippet (summary) and then use regex to find a pattern in this snippet (summary) to grab @gmail addresses.

To grab emails we can do this (it worked for me):

match_email = re.findall(r'[\w\.-]+@[\w\.-]+', snippet) - will find email address.
email = '\n'.join(match_email) - will convert them from list to a string and print them on a new line (\n)

Code and example in online IDE:

import requests, lxml, re, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'site:Facebook.com Dentist gmail.com'}

html = requests.get(f'https://www.google.com/search?q=',
                    headers=headers,
                    params=params).text
soup = BeautifulSoup(html, 'lxml')

data = []

for container in soup.findAll('div', class_='tF2Cxc'):
    title = container.select_one('.DKV0Md').text
    link = container.find('a')['href']

    try:
        rating = container.select_one('g-review-stars+ span').text
    except:
        rating = None

    try:
        votes = container.select_one('span:nth-child(3)').text
    except:
        votes = None

    try:
        price_range = container.select_one('#rso span~ span+ span').text
    except:
        price_range = None

    snippet = container.select_one('.lyLwlc span').text
    
    match_email = re.findall(r'[\w\.-]+@[\w\.-]+', snippet)
    email = '\n'.join(match_email)

    data.append({
        'Title': title,
        'Link': link,
        'Email': email,
        'Rating': rating,
        'Votes': votes,
        'Price_range': price_range,
    })

print(json.dumps(data, indent=2, ensure_ascii=False))

Part of the JSON output:

[
  {
    "Title": "LI Dental Group, LLP - Dentist & Dental Office - 452 Photos ...",
    "Link": "https://www.facebook.com/lidentalgroup/about/",
    "Email": "[email protected]",
    "Rating": "Rating: 2.3",
    "Votes": "3 votes"
  },
  {
    "Title": "Perfect Smiles Dental Studio - General Dentist - Santa Clarita ...",
    "Link": "https://www.facebook.com/dentistcanyoncountry/about/",
    "Email": "[email protected]",
    "Rating": null,
    "Votes": null
  },
  {
    "Title": "Forest Dental Center - General Dentist - Lynchburg, Virginia ...",
    "Link": "https://www.facebook.com/ForestDentalCenter/about/",
    "Email": "[email protected]",
    "Rating": "Rating: 4.8",
    "Votes": "203 votes"
  }
]

Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. A completely free trial is currently under development.

Code to integrate using SerpApi:

from serpapi import GoogleSearch
import os, json, re 

params = {
  "engine": "google",
  "q": "site:Facebook.com Dentist gmail.com",
  "api_key": os.getenv('API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()

data = []

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  snippet = result['snippet']
  match_email = re.findall(r'[\w\.-]+@[\w\.-]+', snippet)
  email = '\n'.join(match_email)
  try:
    rating = result['rich_snippet']['top']['detected_extensions']['rating']
  except:
    rating = None
  try:
    votes = result['rich_snippet']['top']['detected_extensions']['votes']
  except:
    votes = None
  try:
    price_range = result['rich_snippet']['top']['extensions'][2]
  except:
    price_range = None

  data.append({
    'title':title,
    'link':link,
    'email':email,
    'rating':rating,
    'votes':votes,
    'price_range':price_range,
  })

print(json.dumps(data, indent = 2, ensure_ascii = False))

Part of the JSON output:

[
  {
    "title": "LI Dental Group, LLP - Dentist & Dental Office - 452 Photos ...",
    "link": "https://www.facebook.com/lidentalgroup/about/",
    "email": "[email protected]",
    "rating": 2.3,
    "votes": 3,
    "price_range": "‎Price range: $$"
  },
  {
    "title": "Perfect Smiles Dental Studio - General Dentist - Santa Clarita ...",
    "link": "https://www.facebook.com/dentistcanyoncountry/about/",
    "email": "[email protected]",
    "rating": null,
    "votes": null,
    "price_range": null
  }
]

Disclaimer, I work for SerpApi.

Upvotes: 0

joejustin007

Reputation: 21

I figured it out. I had to convert results to UTF8 here si the code I used. Works great!

response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
newvar = soup.decode("utf8")
print("doublej = ", newvar)

Upvotes: 0

Bitto

Reputation: 8245

Google is known to give you different results if the User-Agent is missing in the request.

You can add it like this

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
response = requests.get(url, headers=headers)

Documentation: Custom Headers

Upvotes: 1

Scraping Google Search Results

Answers (3)

Related Questions