Reputation:
I am trying to use Google custom search API,what I want to do is search the first 20 results, I tried changing the num=10
in URL to 20 but gives 400 Error, How can I fix or requests the second page of results ?(Note I am using search entire web)
Here is the code I am using
import requests,json
url="https://www.googleapis.com/customsearch/v1?q=SmartyKat+Catnip+Cat+Toys&cx=012572433248785697579%3A1mazi7ctlvm&num=10&fields=items(link%2Cpagemap%2Ctitle)&key={YOUR_API_KEY}"
res=requests.get(url)
di=json.loads(res.text)
Upvotes: 4
Views: 6743
Reputation: 1
Just edit the search string, remove all the junk and enter in your search string after q= and follow up with &num=1000
ex..
https://www.google.com/search?q=banana&num=100
this will display 100 banana results.
Upvotes: 0
Reputation: 99
You can extract data from Google Search without using API, it will be enough to use BeautifulSoup
web scraping library. Keep in mind that you need to take care of solving CAPTCHA or IP rate limit. Could be done with rotating proxies, user-agents.
You can search for elements on a page using a CSS selectors.
To search for CSS selectors you can use SelectorGadget Chrome extension which allows clicking on the desired element in your browser and returns corresponding CSS selector (not always work perfectly if the website is rendered via JavaScript).
It is also possible to dynamically extract all results from all possible pages using non-token based pagination. It will go through all of them, no matter how many pages there are.
You can add several options for exiting the loop, such as exit by page limit and exit if there is no "next page" button:
if page_num == page_limit: # exit by page limit
break
if soup.select_one(".d6cvqb a[id=pnnext]"): # exit on missing button
params["start"] += 10
else:
break
Check code with pagination in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "SmartyKat Catnip Cat Toys", # query
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 5
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
if page_num == page_limit:
break
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "SmartyKat Catnip Chase Cat Toy - I Love My Pets",
"snippet": "Catnip Chase™ compressed catnip toy Play SmartyKat offers a variety of toys to meet a cat's need for hunting, exercise, excitement, interaction, ...",
"links": "https://www.ilovemypets.ph/index.php?route=product/product&product_id=1670"
},
{
"title": "Kitties & Their Humans - Facebook",
"snippet": "5 IN STOCK* SmartyKat Catnip Cat Toys Brand: SmartyKat Style: Madcap Mania™ Refillable Assorted Mice Catnip Cat Toy Style: Mice (Random Selection)...",
"links": "https://m.facebook.com/2674028906242223/"
},
other results ...
]
Also, like alternative, you can use third-party API like Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Example SerpApi code with pagination:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "SmartyKat Catnip Cat Toys", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if "next_link" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output: the same as in the bs4 solution.
Upvotes: 0
Reputation: 169304
The information in the accepted answer https://stackoverflow.com/a/55866268/42346 is accurate.
Below is a Python function I wrote as an extension of the function in the 4th step of this answer https://stackoverflow.com/a/37084643/42346 to return up to 100 results from the Google Search API. It increases the start parameter by 10 for each API call, handling the number of results to return automatically. For example, if you request 25 results the function will induce 3 API calls of: 10 results, 10 results, and 5 results.
Background information:
For instructions on how to set-up a Google Custom Search engine: https://stackoverflow.com/a/37084643/42346
More detail about how to specify that it search the entire web here:
https://stackoverflow.com/a/11206266/42346
from googleapiclient.discovery import build
from pprint import pprint as pp
import math
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
num_search_results = kwargs['num']
if num_search_results > 100:
raise NotImplementedError('Google Custom Search API supports max of 100 results')
elif num_search_results > 10:
kwargs['num'] = 10 # this cannot be > 10 in API call
calls_to_make = math.ceil(num_search_results / 10)
else:
calls_to_make = 1
kwargs['start'] = start_item = 1
items_to_return = []
while calls_to_make > 0:
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
items_to_return.extend(res['items'])
calls_to_make -= 1
start_item += 10
kwargs['start'] = start_item
leftover = num_search_results - start_item + 1
if 0 < leftover < 10:
kwargs['num'] = leftover
return items_to_return
And here's an example of how you'd call that:
NUM_RESULTS = 25
MY_SEARCH = 'why do cats chase their own tails'
MY_API_KEY = 'Google API key'
MY_CSE_ID = 'Custom Search Engine ID'
results = google_search(MY_SEARCH, MY_API_KEY, MY_CSE_ID, num=NUM_RESULTS)
for result in results:
pp(result)
Upvotes: 4
Reputation: 94
Unfortunately, it is not possible to receive more than 10 results from Google custom search API. However, if you do want more results you can make multiple calls by increasing your start parameter by 10.
See this link: https://developers.google.com/custom-search/v1/using_rest#query-params
Upvotes: 7