Reputation: 2604
I would like to get all the authors names from Google Scholar. My base url is http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security so basically, I look for authors who wrote anything about security.
I wrote some Python script using BeautifulSoup, but (dont know why) the script shows empty lists,
as it did not find any given elements (however, when I look into the page source, I see there <div class="gsc_1usr_text">
elements).
Heres my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
mydivs = soup.findAll("div", { "class" : "gsc_1usr_text" })
print mydivs
and the output is []
, print "LEN = " + str(len(mydivs))
shows me 0.
Im using Python 2.7.3 on Linux Mint 13.
Upvotes: 0
Views: 1462
Reputation: 1724
You might send too many requests or Google detected your script as an automatic script.
The first thing you can try to do is to add proxies to your request:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
Or you can make it work by using either requests-html
or selenium
to render the whole HTML page without using proxies, but you can still get a CAPTCHA.
Code to make it work(I tested code locally):
# If you get an empty array, you get an CAPTCHA from Google.
# Print response to see what cause it.
# Note: code below doesn't do pagination. https://requests-html.kennethreitz.org/#pagination
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security'
response = session.get(url)
# https://requests-html.kennethreitz.org/#requests_html.HTML.render
response.html.render(sleep=1)
for author_name in response.html.find('.gs_ai_name'):
name = author_name.text
print(name)
Output:
Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
Alternatively, you can use Google Scholar Profiles API from SerpApi. It's a paid API with a trial of 5,000 searches. A completely free trial is currently under development.
The main difference is you don't have to think about solving a CAPTCHA or experience a slow scraping process because of the rendering page or stress PC with multiple instances e.g. using selenium
Code to integrate:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_profiles",
"hl": "en",
"mauthors": "label:security",
"api_key": "YOUR_API_KEY"
}
search = GoogleSearch(params)
results = search.get_dict()
for author_name in results['profiles']:
name = author_name['name']
print(name)
Output:
Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
Part of the JSON output:
"profiles": [
{
"name": "Johnson Thomas",
"link": "https://scholar.google.com/citations?hl=en&user=eKLr0EgAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "eKLr0EgAAAAJ",
"affiliations": "Professor of Computer Science, Oklahoma State University",
"email": "Verified email at cs.okstate.edu",
"cited_by": 150263,
"interests": [
{
"title": "Security",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:security"
}
]
}
]
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 174706
Your code works for me.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
mydivs = soup.findAll("div", { "class" : "gsc_1usr_text" })
print mydivs
Output:
[<div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=n-Oret4AAAAJ&hl=pl&oe=Latin2">Adrian Perrig</a></h3><div class="gsc_1usr_aff">Professor of Computer Science at ETH Zürich, Adjunct Professor of ECE and EPP at CMU</div><div class="gsc_1usr_eml">Zweryfikowany adres z inf.ethz.ch</div><div class="gsc_1usr_emlb">@inf.ethz.ch</div><div class="gsc_1usr_cby">Cytowane przez 40938</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:networking">Networking</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:operating_systems">Operating Systems</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:computer_security">Computer Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:internet_security">Internet Security</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=HvwPRJ0AAAAJ&hl=pl&oe=Latin2">Vern Paxson</a></h3><div class="gsc_1usr_aff">Professor, EECS, University of California, Berkeley</div><div class="gsc_1usr_eml">Zweryfikowany adres z berkeley.edu</div><div class="gsc_1usr_emlb">@berkeley.edu</div><div class="gsc_1usr_cby">Cytowane przez 39914</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:networking">Networking</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:measurement">Measurement</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=2pW1g5IAAAAJ&hl=pl&oe=Latin2">Mihir Bellare</a></h3><div class="gsc_1usr_aff">Professor, Department of Computer Science and Engineering, UCSD</div><div class="gsc_1usr_eml">Zweryfikowany adres z eng.ucsd.edu</div><div class="gsc_1usr_emlb">@eng.ucsd.edu</div><div class="gsc_1usr_cby">Cytowane przez 35459</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:cryptography">Cryptography</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:complexity_theory">Complexity Theory</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=FCsdj0YAAAAJ&hl=pl&oe=Latin2">Wenyuan Xu</a></h3><div class="gsc_1usr_aff">Assistant Profess of Department of Computer Science and Engineering, University of South …</div><div class="gsc_1usr_cby">Cytowane przez 32521</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:wireless_networks">Wireless Networks</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:jamming_defenses">jamming defenses</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:dependable_systems">dependable systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=vWTI60AAAAAJ&hl=pl&oe=Latin2">Martin Abadi</a></h3><div class="gsc_1usr_aff">Principal Scientist, Google, and Professor Emeritus, UC Santa Cruz</div><div class="gsc_1usr_eml">Zweryfikowany adres z cs.ucsc.edu</div><div class="gsc_1usr_emlb">@cs.ucsc.edu</div><div class="gsc_1usr_cby">Cytowane przez 29938</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:programming_languages_and_systems">programming languages and systems</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:specification_and_verification">specification and verification</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=lOZ1vHIAAAAJ&hl=pl&oe=Latin2">Sushil Jajodia</a></h3><div class="gsc_1usr_aff">University Professor, BDM International Professor, and Director, Center for Secure …</div><div class="gsc_1usr_eml">Zweryfikowany adres z gmu.edu</div><div class="gsc_1usr_emlb">@gmu.edu</div><div class="gsc_1usr_cby">Cytowane przez 29705</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:privacy">privacy</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:database">database</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:databases">databases</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:distributed_systems">distributed systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=Z_enRVYAAAAJ&hl=pl&oe=Latin2">Xiaolan Zhang</a></h3><div class="gsc_1usr_aff">IBM</div><div class="gsc_1usr_eml">Zweryfikowany adres z us.ibm.com</div><div class="gsc_1usr_emlb">@us.ibm.com</div><div class="gsc_1usr_cby">Cytowane przez 27321</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:virtualization">Virtualization</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:systems">Systems</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=W7YBLlEAAAAJ&hl=pl&oe=Latin2">Jean-Pierre Hubaux</a></h3><div class="gsc_1usr_aff">Professor, EPFL</div><div class="gsc_1usr_eml">Zweryfikowany adres z epfl.ch</div><div class="gsc_1usr_emlb">@epfl.ch</div><div class="gsc_1usr_cby">Cytowane przez 24738</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:privacy">Privacy</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:networking">Networking</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=WgyDcoUAAAAJ&hl=pl&oe=Latin2">Ross Anderson</a></h3><div class="gsc_1usr_aff">University of Cambridge</div><div class="gsc_1usr_eml">Zweryfikowany adres z cl.cam.ac.uk</div><div class="gsc_1usr_emlb">@cl.cam.ac.uk</div><div class="gsc_1usr_cby">Cytowane przez 24445</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">Security</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:cryptology">cryptology</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:dependability">dependability</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:technology_policy">technology policy</a> </div></div>, <div class="gsc_1usr_text"><h3 class="gsc_1usr_name"><a href="/citations?user=lsKlsJ8AAAAJ&hl=pl&oe=Latin2">Heejo Lee</a></h3><div class="gsc_1usr_aff">Professor of Computer Science, Korea University</div><div class="gsc_1usr_eml">Zweryfikowany adres z korea.ac.kr</div><div class="gsc_1usr_emlb">@korea.ac.kr</div><div class="gsc_1usr_cby">Cytowane przez 23596</div><div class="gsc_1usr_int"><a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:network">network</a> <a class="gsc_co_int" href="/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security">security</a> </div></div>]
Upvotes: 1