user7340115
user7340115

Reputation: 31

Scrape authors h-index, i10-index and total citations from Google Scholar

I am working on a project to scrape data from Google Scholar. I want to scrape an authors h-index, total citations and i-10 index (all). For example from Louisa Gilbert I wish to scrape:

h-index = 36
i10-index = 74
citations = 4383

I have written this:

from bs4 import BeautifulSoup
import urllib.request
url="https://scholar.google.ca/citations?user=OdQKi7wAAAAJ&hl=en"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser') 

but I am unsure how to continue. (I understand there are some libraries available, but none allow you to scrape h-index's and i10-index's.)

Upvotes: 3

Views: 1724

Answers (2)

Milos Djurdjevic
Milos Djurdjevic

Reputation: 410

To scrape all of the information from Google Scholar Author page you could use a third party solution like SerpApi. It's a paid API with a free trial.

Example python code (available in other libraries also):

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_scholar_author",
  "hl": "en",
  "author_id": "-muoO7gAAAAJ"
}

search = GoogleSearch(params)
results = search.get_dict()

Example JSON output:

"cited_by": {
  "table": [
    {
      "citations": {
        "all": 7326,
        "since_2016": 2613
      }
    },
    {
      "h_index": {
        "all": 47,
        "since_2016": 27
      }
    },
    {
      "i10_index": {
        "all": 103,
        "since_2016": 79
      }
    }
  ]
}

You can check out the documentation for more details.

Disclaimer: I work at SerpApi.

Upvotes: 0

Carlos Peña
Carlos Peña

Reputation: 221

Your are almost there. You need to find the HTML elements that contain the data that you want to extract. In this particular case, the indexes are included in the tag <td class="gsc_rsb_std">. You need to pick up these tags from the Soup element and then use the method string to recover the text from within the tags:

indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string

Upvotes: 4

Related Questions