ayush singhal
ayush singhal

Reputation: 1939

extract text from google scholar

I am trying to extract the text from the test snippet that google scholar gives for a particular query. By text snippet I mean the text below the title (in black letter). Currently I am trying to extract it from the html file using python but it contains a lot of extra test such as

/div><div class="gs_fl"...etc.

Is there a easy way or some code which can help me get the text without these redundant texts.

Upvotes: 0

Views: 1024

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

And old, but might be a relevant question right now. Use SelectorGadgets to grab CSS selectors easily. Make sure you're using a proxy, otherwise Google might block a request even if you'll try to make a request via selenium.

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=samsung&oq=', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

for result in soup.select('.gs_ri'):
  snippet = result.select_one('.gs_rs').text
  print(f"Snippet: {snippet}")

Part of the output:

Snippet: Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …

Alternatively, you can use Google Scholar Organic Search Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.

Essentially, it does the same thing as the script above, except you don't need to think about how to solve CAPTCHA or find a good proxy(proxies).

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar",
  "q": "samsung",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  print(f"Snippet: {result['snippet']}")

Part of the output:

Snippet: Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …

Disclaimer, I work for SerpApi.

Upvotes: 0

twneale
twneale

Reputation: 2946

You need an html parser:

import lxml.html

doc = lxml.html.fromstring(html)
text = doc.xpath('//div[@class="gs_fl"]').text_content()

You can install lxml with "pip install lxml", but you'll need to build its dependencies, and the details will be different depending on what your platform is.

Upvotes: 1

Related Questions