Reputation: 337
I have a relatively large project where searching Google has returned the best results for our missing values. Using search from google in Python yields me the exact results I need. When trying to use custom search in order to lift my query limits, the results returned aren't remotely close to what I need. I have the following code (suggested in Searching in Google with Python) that returns exactly what I need,which is the exact same thing as when I search in Google's site, but gets blocked due to too many http requests...
from google import search
def google_scrape(url):
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
thepage = opener.open(url)
soup = BeautifulSoup(thepage, "html.parser")
return soup.title.text
i = 1
# queries = ['For. Policy Econ.','Int. J. Soc. For.','BMC Int Health Hum. Rights',
# 'Environ. Health Persp','Environ. Entomol.','Sociol. Rural.','Ecol. Soc.']
search_results = []
abbrevs_searched = []
url_results = []
error_names = []
error = []
#Note, names_to_search is simply a longer version of the commented our queries list.
for abbreviation in names_to_search:
query = abbreviation
for url in search(query, num=2,stop=1):
try:
a = google_scrape(url)
print(str(i) + ". " + a)
search_results.append(a)
abbrevs_searched.append(query)
url_results.append(url)
print(url)
print(" ")
except Exception as e:
error_names.append(query)
error.append(query)
print("\n\n***************"," Exeption: ",e)
i += 1
And I have my Google Custom Search Engine code setup in the following way...
import urllib
from bs4 import BeautifulSoup
import http.cookiejar
from apiclient.discovery import build
"""List of names to search on google"""
names_to_search = set(search_list_1+search_list)
service = build('customsearch', 'v1',developerKey="AIz**********************")
rse = service.cse().list(q="For. Policy Econ.",cx='*******************').execute()
rse
My Google custom search engine settings are set to searching Google.com. As of now, all other settings are default aside from the site being Google.com.
Upvotes: 1
Views: 3338
Reputation: 2036
As far as I can tell the problem with python module is not limitation in python module, but the fact that Google does not allow to scrape pages with scripts. When I run your program (with google module) I am getting HTTP Error 503
. And it is because after too many requests in a short period of time google asks you for captcha confirmation and there is no module that can bypass captcha. An alternative to this problem is to use web search APIs (for example, Google Custom Search API), but almost all of these APIs are paid options (actually they usually offer a free option with low query limits).
The problem with Google Custom Search API is that it was designed to search through your pages.
Google Custom Search enables you to create a search engine for your website, your blog, or a collection of websites. Read more.
UPDATE - May 2020
Next part regarding setting up Google Custom Search has been updated.
(I needed to do google searches in python, and selenium webdriver wasn't an option. So I decided to use Google Custom Search API and went back to my SO answer, but it was outdated (because google changed its developers' interface) and it was incomplete (there was only description how to create Google Custom Search engine but no information how to use it in Python). Because of that I updated my answer, but the old version is still part of this answer down below.)
There is a way how to search the entire web with Google Custom Search API in Python with the following steps:
Creating Google Custom Search engine
To create Google Custom Search engine you need to go to Google Custom Search homepage and click on Add button:
You need to fill out following info:
After you have filled out the form, click Create button:
Editing Google Custom Search engine options
Under Modify your search engine click on button Control Panel:
Under Sites to search (in Basics tab of settings) click on Add button:
Type in http://www.example.org/, set it to Include just this specific page or URL pattern I have entered and click Save:
After that select your old website and click Delete button:
Click OK button to confirm deleting:
Under Search the entire web toggle ON-OFF button (so that it stays turned ON):
Creating Custom Search JSON API key
Under Programmatic Access on the right side of Custom Search JSON API click button Get started:
You should be on this page, under Before you start and then under Identify your application to Google with API key, on the right side of Custom Search Engine (free edition) users click on button Get a Key:
Select a Project that you want to add Google Custom Search API to (, if you don't already have a Google Cloud Project you can see how to create one here) and click button Next:
Click button Done:
Google Custom Search in Python with google-api-python-client
To use API in Python we need Search engine ID and Custom Search JSON API key.
To find Search engine ID go to Google Custom Search homepage and click on search engine name (Google):
Copy Search engine ID and save it somewhere (we'll need this ID later):
To find Custom Search JSON API key go to Credentials tab of Google APIs dashboard, copy API key and save it somewhere (we'll this API key also):
Now, we need to install google-api-python-client, the easiest way is to use pip (see more information on google-api-python-client here):
pip install google-api-python-client
Finally, you can use Google Custom Search in Python like this (the following example is copied from here):
import pprint
from googleapiclient.discovery import build
service = build('customsearch', 'v1', developerKey='your-API-key') # replace "'your-API-key' with your API key
# q is seacrh term that you want to search on google.com
res = service.cse().list(q='search term', cx='search-engine-ID').execute() # replace 'search-engine-ID' with your Search engine ID
pprint.pprint(res)
OLD (part of answer regarding Google Custom Search) - August 2017
Here is the previous explanation of how to search the entire web with Google Custom Search.
(Images in this old part of answer were replaced with links because they were taking too much space. Also steps to create Google Custom Search engine (that were copied from Bangkokian's answer to this answer as a quote) are removed and replaced with a link to Bangkokian's answer. (Because of changes in Google developers interface, these steps from Bangkokian's answer are outdated.))
First you need to create a Google Custom Search engine.
Bangkokian explained creating a Google Custom Search engine in his answer.
After you have already created a Custom Search Engine, you need to go to Google Custom Search and click on Search Engine you already have (it will probably be "Google", marked with the red box on picture bellow):
Now you need to in the Search Preferences section, select Search the entire web but emphasize included sites (step 7) and then click on add button:
Image - GCS Preferences section
Type in http://www.example.org/, set it to include only a specific page and click Save:
Image - GCS Adding example.org website
After that select your old website and click Delete:
Image - GCS Deleting old website
Update it to save the changes:
(Following part of answer regarding remarks and notes for Google Custom Search is still valid.)
Unfortunately, Google Custom Search API will not provide the same result as searching on the web:
Note that results may not match the results you'd get by searching on Google Web Search. Read more.
However, you can configure your custom search engine to search the whole web. In this case, however, your results are unlikely to match those returned by Google Web Search. Read more.
Also, you can only use free version:
This article applies only to free basic custom search engines. You can't set Google Site Search to search the entire web. Read more.
And there is a limit of 100 search queries per day:
For CSE users, the API provides 100 search queries per day for free. Read more.
Only another option is to use API from another search engine. And it seem that only one that is free is FAROO API.
Edit:
You can use selenium webdriver in python to imitate browser usage. There are options to use Firefox, Chrome, Edge or Safari webdrivers (it actually opens Chrome and does your search), but this is annoying because you don't actually want to see the browser. But there is solution for this you can use PhantomJS.
PhantomJS is a headless WebKit scriptable with a JavaScript API.
Download from here. Extracted and see how to use it in example below (I wrote a simple class which you can use, you just need to change the path to PhantomJS):
import time
from urllib.parse import quote_plus
from selenium import webdriver
class Browser:
def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
self.path = path
self.implicit_wait_time = implicit_wait_time # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
self.explicit_wait_time = explicit_wait_time # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
if initiate:
self.start()
return
def start(self):
self.driver = webdriver.PhantomJS(path)
self.driver.implicitly_wait(self.implicit_wait_time)
return
def end(self):
self.driver.quit()
return
def go_to_url(self, url, wait_time = None):
if wait_time is None:
wait_time = self.explicit_wait_time
self.driver.get(url)
print('[*] Fetching results from: {}'.format(url))
time.sleep(wait_time)
return
def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
query = quote_plus(query)
url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
return url
def scrape(self):
#xpath migth change in future
links = self.driver.find_elements_by_xpath("//h3[@class='r']/a[@href]") # searches for all links insede h3 tags with class "r"
results = []
for link in links:
d = {'url': link.get_attribute('href'),
'title': link.text}
results.append(d)
return results
def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
if wait_time is None:
wait_time = self.explicit_wait_time
url = self.get_search_url(query, page_num, per_page, lang)
self.go_to_url(url, wait_time)
results = self.scrape()
return results
path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('For. Policy Econ.')
for r in results:
print(r)
br.end()
Upvotes: 4