Reputation: 31
I am working on a project to scrape data from Google Scholar. I want to scrape an authors h-index, total citations and i-10 index (all). For example from Louisa Gilbert I wish to scrape:
h-index = 36
i10-index = 74
citations = 4383
I have written this:
from bs4 import BeautifulSoup
import urllib.request
url="https://scholar.google.ca/citations?user=OdQKi7wAAAAJ&hl=en"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
but I am unsure how to continue. (I understand there are some libraries available, but none allow you to scrape h-index's and i10-index's.)
Upvotes: 3
Views: 1724
Reputation: 410
To scrape all of the information from Google Scholar Author page you could use a third party solution like SerpApi. It's a paid API with a free trial.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "SECRET_API_KEY",
"engine": "google_scholar_author",
"hl": "en",
"author_id": "-muoO7gAAAAJ"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"cited_by": {
"table": [
{
"citations": {
"all": 7326,
"since_2016": 2613
}
},
{
"h_index": {
"all": 47,
"since_2016": 27
}
},
{
"i10_index": {
"all": 103,
"since_2016": 79
}
}
]
}
You can check out the documentation for more details.
Disclaimer: I work at SerpApi.
Upvotes: 0
Reputation: 221
Your are almost there. You need to find the HTML elements that contain the data that you want to extract. In this particular case, the indexes are included in the tag <td class="gsc_rsb_std">
. You need to pick up these tags from the Soup element and then use the method string
to recover the text from within the tags:
indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string
Upvotes: 4