Reputation: 293
I'm working on a project for which I need to extract some data from Google Scholar. My PHP program takes a string from my local machine, passes it to the Google Scholar and on the search results page it takes out the first result and saves it to the database.
I have to do this for almost 90 thousand strings/queries. The problem is that after a few hundred entries the program stops as the Google Scholar asks for captcha verification. What can I do about that?
Upvotes: 0
Views: 1841
Reputation: 1724
You can try to use Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
It bypasses blocks from search engines via dedicated proxies, CAPTCHA solving service, can scale to enterprise, no need to create a parser from scratch and maintain it.
Code and example to integrate with PHP in the online IDE:
<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
require __DIR__ . '/vendor/autoload.php';
$queries = array(
"moon",
"pandas",
"python",
"data science",
"ML",
"AI",
"animals",
"amd",
"nvidia",
"intel",
"asus",
"robbery pi",
"latex, tex",
"amg",
"blizzard",
"world of warcraft",
"cs go",
"antarctica",
"fifa",
"amsterdam",
"usa",
"tesla",
"economy",
"ecology",
"biology"
);
foreach ($queries as $query) {
$params = [
"engine" => "google_scholar",
"q" => $query,
"hl" => "en"
];
$client = new GoogleSearch(getenv("API_KEY"));
$response = $client->get_json($params);
print_r("Extracting search query: {$query}\n");
foreach ($response->organic_results as $result) {
print_r("{$result->title}\n");
}
}
?>
Code and example code to integrate with Python in the online IDE:
from serpapi import GoogleScholarSearch
import os
queries = ["moon",
"pandas",
"python",
"data science",
"ML",
"AI",
"animals",
"amd",
"nvidia",
"intel",
"asus",
"robbery pi",
"latex, tex",
"amg",
"blizzard",
"world of warcraft",
"cs go",
"antarctica",
"fifa",
"amsterdam",
"usa",
"tesla",
"economy",
"ecology",
"biology"]
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
"hl": "en"
}
search = GoogleScholarSearch(params)
results = search.get_dict()
print(f"Extracting search query: {query}")
for result in results["organic_results"]:
print(result["title"])
Output:
Extracting search query: moon
Cellulose nanomaterials review: structure, properties and nanocomposites
Reflection in learning and professional development: Theory and practice
...
Extracting search query: biology
A new biology for a new century
The biology of mycorrhiza.
Disclaimer, I work for SerpApi.
Upvotes: 1
Reputation: 1323
Because Google Scholar does not have an API, there is no documented way to do what you want. You are not supposed to scrape data like this, which is why you are running into Google's bot-protection features. I think that your only real option is to wait for Google to create an API.
Upvotes: 1