Tushar Bakaya
Tushar Bakaya

Reputation: 45

BeautifulSoup can't crawl google search results?

Trying to crawl through google search results. This code works pretty well with all the other sites, I have tried, however not working with google. It returns an empty list.

from BeautifulSoup import BeautifulSoup
import requests

def googlecrawler(search_term):
    url="https://www.google.co.in/?gfe_rd=cr&ei=UVSeVZazLozC8gfU3oD4DQ&gws_rd=ssl#q="+search_term
    junk_code=requests.get(url)
    ok_code=junk_code.text
    good_code=BeautifulSoup(ok_code)
    best_code=good_code.findAll('h3',{'class':'r'})
    print best_code


googlecrawler("healthkart") 

It should return something like this.

<h3 class="r"><a href="/url?  sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=6&amp;cad=rja&amp;uact=8&amp;ved=0CEIQFjAF&amp;url=http%3A%2F%2Fwww.coupondunia.in%2Fhealthkart&amp;ei=qFmfVc2fFNO0uASti4PwDQ&amp;usg=AFQjCNFHMzqn-rH4Hp-fZK0E4wwxJmevEg&amp;sig2=QgwxMBdbPndyQTSH10dV2Q" onmousedown="return rwt(this,'','','','6','AFQjCNFHMzqn-rH4Hp-fZK0E4wwxJmevEg','QgwxMBdbPndyQTSH10dV2Q','0CEIQFjAF','','',event)" data-href="http://www.coupondunia.in/healthkart">HealthKart Coupons: July 2015 Coupon Codes</a></h3>

Upvotes: 0

Views: 1897

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

One of the common solutions is to add user-agent and pass intro request headers to fake real user visit:

# https://www.whatismybrowser.com/guides/the-latest-user-agent/
headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

So your code will look like this:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

def googlecrawler(search_term): 
    html = requests.get(f'https://www.google.com/search?q=', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    for container in soup.findAll('div', class_='tF2Cxc'):
        title = container.select_one('.DKV0Md').text
        link = container.find('a')['href']
        print(f'{title}\n{link}')

googlecrawler('site:Facebook.com Dentist gmail.com')

# part of the output:
'''
COVID-19 Office Update Dear... - Canton Dental Associates ...
https://www.facebook.com/permalink.php?id=107567882605441&story_fbid=3205134459515419

Spinelli Dental - General Dentist - Rochester, New York ...
https://www.facebook.com/spinellidental/about/?referrer=services_landing_page

LaboSmile USA Dentist & Dental Office in Delray ... - Facebook
https://www.facebook.com/labosmileusa/
'''

Alternatively, you can do it by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

Code to integrate:

from serpapi import GoogleSearch
import os, json, re

params = {
  "engine": "google",
  "q": "site:Facebook.com Dentist gmail.com",
  "api_key": os.getenv('API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  print(f'{title}\n{link}\n')

# part of the output:
'''
Green Valley Dental - About | Facebook
https://www.facebook.com/GVDFamily/about/

My Rivertown Dentist - About | Facebook
https://www.facebook.com/Rivertownfamily/about/

COVID-19 Office Update Dear... - Canton Dental Associates ...
https://www.facebook.com/permalink.php?id=107567882605441&story_fbid=3205134459515419
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

user3636636
user3636636

Reputation: 2499

Looking at good_code i can't see a h3 or class "r" at all. That would be why your code is returning an empty list.

There is no problem with your code as such, but rather, that what you are searching for is not there.

What were you expecting to return?

Upvotes: 0

Related Questions