Min
Min

Reputation: 337

Configuring Google Custom Search to work like google.search()

I have a relatively large project where searching Google has returned the best results for our missing values. Using search from google in Python yields me the exact results I need. When trying to use custom search in order to lift my query limits, the results returned aren't remotely close to what I need. I have the following code (suggested in Searching in Google with Python) that returns exactly what I need,which is the exact same thing as when I search in Google's site, but gets blocked due to too many http requests...

from google import search

def google_scrape(url):
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    thepage = opener.open(url)
    soup = BeautifulSoup(thepage, "html.parser")
    return soup.title.text

i = 1
# queries = ['For. Policy Econ.','Int. J. Soc. For.','BMC Int Health Hum. Rights',
#            'Environ. Health Persp','Environ. Entomol.','Sociol. Rural.','Ecol. Soc.']

search_results = []    
abbrevs_searched = []   
url_results = []  

error_names = []
error = []

#Note, names_to_search is simply a longer version of the commented our queries list. 
for abbreviation in names_to_search:   
    query = abbreviation
    for url in search(query, num=2,stop=1):
        try:
            a = google_scrape(url)
            print(str(i) + ". " + a)
            search_results.append(a)
            abbrevs_searched.append(query)
            url_results.append(url)
            print(url)
            print(" ")
        except Exception as e:
            error_names.append(query)
            error.append(query)
            print("\n\n***************"," Exeption: ",e)
        i += 1

And I have my Google Custom Search Engine code setup in the following way...

import urllib
from bs4 import BeautifulSoup
import http.cookiejar
from apiclient.discovery import build
"""List of names to search on google"""
names_to_search = set(search_list_1+search_list)
service = build('customsearch', 'v1',developerKey="AIz**********************")
rse = service.cse().list(q="For. Policy Econ.",cx='*******************').execute()
rse

My Google custom search engine settings are set to searching Google.com. As of now, all other settings are default aside from the site being Google.com.

Upvotes: 1

Views: 3338

Answers (1)

ands
ands

Reputation: 2036

As far as I can tell the problem with python module is not limitation in python module, but the fact that Google does not allow to scrape pages with scripts. When I run your program (with google module) I am getting HTTP Error 503. And it is because after too many requests in a short period of time google asks you for captcha confirmation and there is no module that can bypass captcha. An alternative to this problem is to use web search APIs (for example, Google Custom Search API), but almost all of these APIs are paid options (actually they usually offer a free option with low query limits).

Web search APIs

Google Custom Search API

The problem with Google Custom Search API is that it was designed to search through your pages.

Google Custom Search enables you to create a search engine for your website, your blog, or a collection of websites. Read more.

UPDATE - May 2020

Next part regarding setting up Google Custom Search has been updated.

(I needed to do google searches in python, and selenium webdriver wasn't an option. So I decided to use Google Custom Search API and went back to my SO answer, but it was outdated (because google changed its developers' interface) and it was incomplete (there was only description how to create Google Custom Search engine but no information how to use it in Python). Because of that I updated my answer, but the old version is still part of this answer down below.)

There is a way how to search the entire web with Google Custom Search API in Python with the following steps:

  1. Create Google Custom Search engine
  2. Edit Google Custom Search engine options
  3. Create Custom Search JSON API key
  4. Use Google Custom Search in Python with google-api-python-client

Creating Google Custom Search engine

To create Google Custom Search engine you need to go to Google Custom Search homepage and click on Add button:

Image - Google Custom Search - Add button

You need to fill out following info:

  • Sites to search - you can put any URL, for example www.anyurl.com
  • Name of the search engine - you can put any name you want, for example Google

After you have filled out the form, click Create button:

Image - Google Custom Search - Create form

Editing Google Custom Search engine options

Under Modify your search engine click on button Control Panel:

Image - Google Custom Search - Congrats page

Under Sites to search (in Basics tab of settings) click on Add button:

Image - Google Custom Search - Sites to search

Type in http://www.example.org/, set it to Include just this specific page or URL pattern I have entered and click Save:

Image - Google Custom Search - New site

After that select your old website and click Delete button:

Image - Google Custom Search - Delete old site

Click OK button to confirm deleting:

Image - Google Custom Search - Delete confirmation

Under Search the entire web toggle ON-OFF button (so that it stays turned ON):

Image - Google Custom Search - Search the entire web

Creating Custom Search JSON API key

Under Programmatic Access on the right side of Custom Search JSON API click button Get started:

Image - Google Custom Search - Link to Custom Search JSON API

You should be on this page, under Before you start and then under Identify your application to Google with API key, on the right side of Custom Search Engine (free edition) users click on button Get a Key:

Image - Custom Search JSON API - Link to New key

Select a Project that you want to add Google Custom Search API to (, if you don't already have a Google Cloud Project you can see how to create one here) and click button Next:

Image - Custom Search JSON API - New key

Click button Done:

Image - Custom Search JSON API - New key confirmation

Google Custom Search in Python with google-api-python-client

To use API in Python we need Search engine ID and Custom Search JSON API key.

To find Search engine ID go to Google Custom Search homepage and click on search engine name (Google):

Image - Google Custom Search homepage

Copy Search engine ID and save it somewhere (we'll need this ID later):

Image - Google Custom Search - Search engine ID

To find Custom Search JSON API key go to Credentials tab of Google APIs dashboard, copy API key and save it somewhere (we'll this API key also):

enter image description here

Now, we need to install google-api-python-client, the easiest way is to use pip (see more information on google-api-python-client here):

pip install google-api-python-client

Finally, you can use Google Custom Search in Python like this (the following example is copied from here):

import pprint
from googleapiclient.discovery import build

service = build('customsearch', 'v1', developerKey='your-API-key') # replace "'your-API-key' with your API key

# q is seacrh term that you want to search on google.com
res = service.cse().list(q='search term', cx='search-engine-ID').execute() # replace 'search-engine-ID' with your Search engine ID

pprint.pprint(res)

OLD (part of answer regarding Google Custom Search) - August 2017

Here is the previous explanation of how to search the entire web with Google Custom Search.

(Images in this old part of answer were replaced with links because they were taking too much space. Also steps to create Google Custom Search engine (that were copied from Bangkokian's answer to this answer as a quote) are removed and replaced with a link to Bangkokian's answer. (Because of changes in Google developers interface, these steps from Bangkokian's answer are outdated.))

First you need to create a Google Custom Search engine.

Bangkokian explained creating a Google Custom Search engine in his answer.

After you have already created a Custom Search Engine, you need to go to Google Custom Search and click on Search Engine you already have (it will probably be "Google", marked with the red box on picture bellow):

Image - Google Custom Search

Now you need to in the Search Preferences section, select Search the entire web but emphasize included sites (step 7) and then click on add button:

Image - GCS Preferences section

Type in http://www.example.org/, set it to include only a specific page and click Save:

Image - GCS Adding example.org website

After that select your old website and click Delete:

Image - GCS Deleting old website

Update it to save the changes:

Image - GCS Saving changes

(Following part of answer regarding remarks and notes for Google Custom Search is still valid.)

Unfortunately, Google Custom Search API will not provide the same result as searching on the web:

Note that results may not match the results you'd get by searching on Google Web Search. Read more.

However, you can configure your custom search engine to search the whole web. In this case, however, your results are unlikely to match those returned by Google Web Search. Read more.

Also, you can only use free version:

This article applies only to free basic custom search engines. You can't set Google Site Search to search the entire web. Read more.

And there is a limit of 100 search queries per day:

For CSE users, the API provides 100 search queries per day for free. Read more.

FAROO API

Only another option is to use API from another search engine. And it seem that only one that is free is FAROO API.

Edit:

Selenium webdriver

You can use selenium webdriver in python to imitate browser usage. There are options to use Firefox, Chrome, Edge or Safari webdrivers (it actually opens Chrome and does your search), but this is annoying because you don't actually want to see the browser. But there is solution for this you can use PhantomJS.

PhantomJS is a headless WebKit scriptable with a JavaScript API.

Download from here. Extracted and see how to use it in example below (I wrote a simple class which you can use, you just need to change the path to PhantomJS):

import time
from urllib.parse import quote_plus
from selenium import webdriver


class Browser:

    def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
        self.path = path
        self.implicit_wait_time = implicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        self.explicit_wait_time = explicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        if initiate:
            self.start()
        return

    def start(self):
        self.driver = webdriver.PhantomJS(path)
        self.driver.implicitly_wait(self.implicit_wait_time)
        return

    def end(self):
        self.driver.quit()
        return

    def go_to_url(self, url, wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        self.driver.get(url)
        print('[*] Fetching results from: {}'.format(url))
        time.sleep(wait_time)
        return

    def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
        query = quote_plus(query)
        url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
        return url

    def scrape(self):
        #xpath migth change in future
        links = self.driver.find_elements_by_xpath("//h3[@class='r']/a[@href]") # searches for all links insede h3 tags with class "r"
        results = []
        for link in links:
            d = {'url': link.get_attribute('href'),
                 'title': link.text}
            results.append(d)
        return results

    def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        url = self.get_search_url(query, page_num, per_page, lang)
        self.go_to_url(url, wait_time)
        results = self.scrape()
        return results




path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('For. Policy Econ.')
for r in results:
    print(r)

br.end()

Upvotes: 4

Related Questions