Reputation: 85

Scrape Google Scholar Security Page

I have a string like this:

url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'

I wish to convert it to this:

converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'

I have tried this:

converted_url = url.decode('utf-8')

However, this error is thrown:

AttributeError: 'str' object has no attribute 'decode'

Upvotes: 0

Answers (2)

Dmitriy Zub

Reputation: 1724

You can use requests to do decoding automatically for you.

Note: after_author URL parameter is a next page token, so when you make a request to the exact URL you provided, the returned HTML will not be the same as you expect because after_author URL parameters changes on every request, for example in my case it is different - uB8AAEFN__8J, and in your URL it's rukAAOJ8__8J.

To get it to work you need to parse the next page token from the first page that will lead to the second page and so on, for example:

# from my other answer: 
# https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/scrape_all_scholar_profiles_bs4.py

params = {
    "view_op": "search_authors",
    "mauthors": "valve",
    "hl": "pl",
    "astart": 0
}

authors_is_present = True
while authors_is_present:
    
    # if next page is present -> update next page token and increment to the next page
    # if next page is not present -> exit the while loop
    if soup.select_one("button.gs_btnPR")["onclick"]:
        params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1)  # -> XB0HAMS9__8J
        params["astart"] += 10
    else:
        authors_is_present = False

Code and example to extract profiles data in the online IDE:

from parsel import Selector
import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "label:security",
    "hl": "pl",
    "view_op": "search_authors"
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.pl/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

profiles = []

for profile in selector.css(".gs_ai_chpr"):
    profile_name = profile.css(".gs_ai_name a::text").get()
    profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    profile_email = profile.css(".gs_ai_eml::text").get()
    profile_interests = profile.css(".gs_ai_one_int::text").getall()

    profiles.append({
        "profile_name": profile_name,
        "profile_link": profile_link,
        "profile_email": profile_email,
        "profile_interests": profile_interests
    })

print(json.dumps(profiles, indent=2))

Alternatively, you can achieve the same thing using Google Scholar Profiles API from SerpApi. It's a paid API with a free plan.

The difference is that you don't need to figure out how to extract data, bypass blocks from search engines, increase the number of requests, and so on.

Example code to integrate:

from serpapi import GoogleSearch
import os, json

params = {
    "api_key": os.getenv("API_KEY"),     # SerpApi API key
    "engine": "google_scholar_profiles", # SerpApi profiles parsing engine
    "hl": "pl",                          # language
    "mauthors": "label:security"         # search query
}

search = GoogleSearch(params)
results = search.get_dict()

for profile in results["profiles"]:
    print(json.dumps(profile, indent=2))

# part of the output:
'''
{
  "name": "Johnson Thomas",
  "link": "https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ",
  "serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=pl",
  "author_id": "eKLr0EgAAAAJ",
  "affiliations": "Professor of Computer Science, Oklahoma State University",
  "email": "Zweryfikowany adres z cs.okstate.edu",
  "cited_by": 159999,
  "interests": [
    {
      "title": "Security",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Asecurity",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:security"
    },
    {
      "title": "cloud computing",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Acloud_computing",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:cloud_computing"
    },
    {
      "title": "big data",
      "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Abig_data",
      "link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:big_data"
    }
  ],
  "thumbnail": "https://scholar.google.com/citations/images/avatar_scholar_56.png"
}
'''

Disclaimer, I work for SerpApi.

Upvotes: 1

furas

Reputation: 142631

decode is used to convert bytes into string. And your url is string, not bytes.

You can use encode to convert this string into bytes and later use decode to convert to correct string.

(I use prefix r to simulate text with this problem - without prefix url doesn't have to be converted)

url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
print(url)

url = url.encode('utf-8').decode('unicode_escape')
print(url)

result:

http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10

http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10

BTW: first check print(url) maybe you have correct url but you use wrong method to display it. Python Shell displays all result without print() using print(repr()) which display some chars as code to show what endcoding is used in text (utf-8, iso-8859-1, win-1250, latin-1, etc.)

Upvotes: 0

Scrape Google Scholar Security Page

Answers (2)

Related Questions