Reputation: 443
I would like to scrap publications from google scholar profile with SimpleHtmlDom.
I have script for scraping the projects, but the problem is, that i am able to scrap only projects, that are shown.
When i am using url like this
$html->load_file("http://scholar.google.se/citations?user=Sx4G9YgAAAAJ");
there are shown only 20 projects. I can increase the number when i change the url
$html->load_file("https://scholar.google.se/citations?user=Sx4G9YgAAAAJ&hl=&view_op=list_works&pagesize=100");
by set the "pagesize" attribute. But the problem is, that 100 is maximum number of publications, what is webpage able to show. Is there some way how to scrap all the projects from profile?
Upvotes: 1
Views: 1250
Reputation: 410
You have to pass additional pagination parameter to the request url.
cstart
- Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize
- Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
So, your URL should look like this:
https://scholar.google.com/citations?user=WLBAYWAAAAAJ&hl=en&cstart=100&pagesize=100
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example PHP code (available in other libraries also) to retrieve the second page of results:
require 'path/to/google_search_results';
$query = [
"api_key" => "secret_api_key",
"engine" => "google_scholar_author",
"hl" => "en",
"author_id" => "WLBAYWAAAAAJ",
"num" => "100",
"start" => "100"
];
$search = new GoogleSearch();
$results = $search->json($query);
Example JSON output:
"articles": [
{
"title": "Geographic localization of knowledge spillovers as evidenced by patent citations",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WLBAYWAAAAAJ&cstart=100&pagesize=100&citation_for_view=WLBAYWAAAAAJ:HGTzPopzzJcC",
"citation_id": "WLBAYWAAAAAJ:HGTzPopzzJcC",
"authors": "AB Jaffe, M Trajtenberg, R Henderson",
"publication": "Patents, citations, and innovations: a window on the knowledge economy, 155-178, 2002",
"cited_by": {
"value": 18,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=8561816228378857607",
"serpapi_link": "https://serpapi.com/search.json?cites=8561816228378857607&engine=google_scholar&hl=en",
"cites_id": "8561816228378857607"
},
"year": "2002"
},
{
"title": "IPR, innovation, economic growth and development",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WLBAYWAAAAAJ&cstart=100&pagesize=100&citation_for_view=WLBAYWAAAAAJ:70eg2SAEIzsC",
"citation_id": "WLBAYWAAAAAJ:70eg2SAEIzsC",
"authors": "AGZ Hu, AB Jaffe",
"publication": "Department of Economics, National University of Singapore, 2007",
"cited_by": {
"value": 17,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7886734392494692167",
"serpapi_link": "https://serpapi.com/search.json?cites=7886734392494692167&engine=google_scholar&hl=en",
"cites_id": "7886734392494692167"
},
"year": "2007"
},
...
]
Check out the documentation for more details.
Disclaimer: I work at SerpApi.
Upvotes: 0
Reputation: 1326
You cannot get all of the projects at once but you can get 100 projects at a time then get another 100 and so on, here is the URL
https://scholar.google.com/citations?user=Sx4G9YgAAAAJ&hl=&view_op=list_works&cstart=100&pagesize=100
In the above URL focus on cstart attribute, let's say you already grabbed 100 projects so now you will enter cstart=100
and grab another 100 list and then cstart=200
and so on until you get all of the publications.
Hope this helps
Upvotes: 3