Reputation: 27
I am trying to webscrape for uni, but it's hard to do so from Google Scholar. I've tried many things and apparently it's got to do with .json()
.
I want to make a function that inputs brands such as Apple and Samsung, and returns a list of headers with their respective abstracts.
Please could someone help me out here! Thank you! Below, I've written what I have so far and hashed out some other things I've tried.
from bs4 import BeautifulSoup
import requests
import csv
import json
brand = input("Enter Technology: ")
source = requests.get('https://scholar.google.com/scholar?0&q={0}+technology'.format(brand)).text
soup = BeautifulSoup(source, 'lxml')
#script = soup.select_one('[type="application/ld+json"]').text
#data = json.loads(script)
#soup = BeautifulSoup(data['description'], 'lxml')
headers = soup.find_all('div', class_="gs_rt")
print(headers)
Upvotes: 2
Views: 556
Reputation: 1724
The first thing you can do is to add proxies to your request:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
Request code will be like this:
html = requests.get('google scholar link', headers=headers, proxies=proxies).text
Or, the more naive method is to set random breaks between each request, or to go around it you can use selenium
or requests-html
or pyppeteer
to render the page without using proxies, but it still might block your requests if you send too many at the same time.
'''
If you'll get an empty array, this means you get a CAPTCHA.
Print response text to see what is going on or wait sometime before sending requests again.
'''
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=samsung&btnG=')
# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()
# Container where data we need is located
for result in response.html.find('.gs_ri'):
title = result.find('.gs_rt', first = True).text
print(title)
Alternatively, you can scrape data from Google Scholar using Google Scholar API from SerpApi. No need to think about how to bypass Google blocking or render a Javascript page. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google_scholar",
"q": "samsung",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(f'Title: result['title']')
Disclaimer, I work for SerpApi.
Upvotes: 1
Reputation: 1826
Google scholar is javascript enable website Use selenium to scrape the site will be a perfect solution for more details refer here
Upvotes: 0
Reputation: 17358
Google Scholar links to different sites like sciencedirect, acm etc... I have added selectors only for sciencedirect and acm. You can add more if you want.
Google scholar paginates using index like for page 1 start
is 0, page 2 start
is 10. The following script asks for brand, and number of pages to crawl. It saves 2 files - one json and one csv.
from bs4 import BeautifulSoup
import requests, time
import pandas as pd
import json
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
brand = input("Enter Technology: ")
pages = int(input("Number of pages: "))
url = "https://scholar.google.com/scholar?start={}&q={}+technology&hl=en&as_sdt=0,5"
data = []
for i in range(0,pages*10+1,10):
print(url.format(i, brand))
res = requests.get(url.format(i, brand),headers=headers)
main_soup = BeautifulSoup(res.text, "html.parser")
divs = main_soup.find_all("div", class_="gs_r gs_or gs_scl")
for div in divs:
temp = {}
h3 = div.find("h3", class_="gs_rt")
temp["Link"] = h3.find("a")["href"]
temp["Heading"] = h3.find("a").get_text(strip=True)
temp["Authors"] = div.find("div",class_="gs_a").get_text(strip=True)
print(temp["Link"])
try:
res_link = requests.get(temp["Link"], headers=headers)
soup_link = BeautifulSoup(res_link.text,"html.parser")
if "sciencedirect" in temp["Link"]:
temp["Abstract"] = soup_link.find("div", class_="abstract author").find("div").get_text(strip=True)
elif "acm" in temp["Link"]:
temp["Abstract"] = soup_link.find("div", class_="abstractSection abstractInFull").get_text(strip=True)
except: pass
data.append(temp)
time.sleep(1)
with open("data.json", "w") as f:
json.dump(data,f)
pd.DataFrame(data).to_csv("data.csv", index=False)
Output:
Link,Heading,Authors,Abstract
https://www.sciencedirect.com/science/article/pii/0149197096000078,Development of pyroprocessingtechnology,"JJ Laidler, JE Battles, WE Miller, JP Ackerman… - Progress in Nuclear …, 1997 - Elsevier","A compact, efficient method for recycling IFR fuel is being developed. This method, known as pyroprocessing, capitalizes on the use of metal fuel in the IFR and provides separation of actinide elements from fission products by means of an electrorefining step. The process of electrorefining is based on well-understood electrochemical concepts, the applications of which are described in this chapter. With only the addition of head-end processing steps, the pyroprocess can be applied with equal success to fuel types other than metal, enabling a symbiotic system wherein the IFR can be used to fission the actinide elements in spent nuclear fuel from other types of reactor."
https://www.sciencedirect.com/science/article/pii/S0041624X97001467,Acoustic wave sensors and theirtechnology,"MJ Vellekoop - Ultrasonics, 1998 - Elsevier","In the past two decades, acoustic-wave devices have gained enormous interest for sensor applications. The delay line device, where a transmitting and a receiving interdigital transducer are realized on a (piezoelectric) substrate is the most common structure used. The sensitive part is the surface between the two transducers. By placing the device in the feedback loop of an amplifier, an acoustic-wave oscillator is formed with properties such as inherent high sensitivity, high resolution, high stability and a frequency output signal which is easy to process.A very interesting development is the large amount of wave types now available for sensor applications. Sensors have been published using Rayleigh waves, Lamb waves, Love waves, acoustic plate modes, and surface transverse waves (STW). Each of these wave types have their special advantages and disadvantages with respect to sensitivity, stability, usability in liquids or gases, and fabrication complexity. For the fabrication of the acoustic-wave devices, planar technologies are used, which will be discussed in the paper. Examples will be given of gas sensors, biochemical sensors in liquids, viscosity and density sensing and high-voltage sensing. A comparison of the usability of the different wave types will be presented."
https://www.sciencedirect.com/science/article/pii/0167268188900558,Technologyand transaction cost economics: a reply,"OE Williamson - Journal of Economic Behavior & Organization, 1988 - Elsevier","I argue here, as I have previously, that technology is neither fully determinative of nor irrelevant to economic organization. Transaction cost economizing occupies a prominent position in any effort to assess the efficacy of alternative forms of economic organization."
https://www.sciencedirect.com/science/article/pii/0048733394900140,Learning by trying: the implementation of configurationaltechnology,"J Fleck- Research policy, 1994 - Elsevier","In this paper some issues concerning the nature of technological development are examined, with particular reference to a case study of the implementation of Computer Aided Production Management (CAPM). CAPM is an example of a configurational technology, built up to meet specific organizational requirements. It is argued that there is scope in the development of configurations for significant innovation to take place during implementation itself, through a distinctive form of learning by ‘struggling to get it to work’, or ‘learning by trying’. Some policy implications are outlined in conclusion: the need to recognize the creative opportunities available in this type of development, and the need to facilitate industrial sector-based learning processes."
...
...
...
Upvotes: 0