E. Aly
E. Aly

Reputation: 331

Beautifulsoup doesn't reach a child element

I wrote the following code trying to scrape a google scholar page

import requests as req
from bs4 import BeautifulSoup as soup

url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'

session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 =  html2bs.find('div', {'id':"gs_cit1"})

but the gs_citd gives me only this line <div aria-live="assertive" id="gs_citd"></div> and doesn't reach any level beneath it. Also gs_cit1 returns a None.

As appearing in this image

I want to reach the highlighted class to be able to grab the BibTeX citation.

Can you help, please!

Upvotes: 1

Views: 1249

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You can parse BibTeX data using beautifulsoup and requests by parsing data-cid attribute which is a unique publication ID. Then you need to temporarily store those IDs to a list, iterate over them, and make a request to every ID to parse BibTeX publication citation.

enter image description here


Example below will work for ~10-20 requests then Google will throw a CAPTCHA or you'll hit the rate limit. The ideal solution is to have a CAPTCHA solving service as well as proxies.

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "q": "samsung",
    "hl": "en"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
    "server": "scholar",
    "referer": f"https://scholar.google.com/scholar?hl={params['hl']}&q={params['q']}",
}


def cite_ids() -> list:
    response = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=10)
    soup = BeautifulSoup(response.text, "lxml")
    
    # returns a list of publication ID's -> U8bh6Ca9uwQJ 
    return [result["data-cid"] for result in soup.select(".gs_or")]


def scrape_cite_results() -> list:

    bibtex_data = []

    for cite_id in cite_ids():
        response = requests.get(f"https://scholar.google.com/scholar?output=cite&q=info:{cite_id}:scholar.google.com", headers=headers, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")
       
        # selects first matched element which in this case always will be BibTeX
        # if Google will not switch BibTeX position.
        bibtex_data.append(soup.select_one(".gs_citi")["href"])
    
    # returns a list of BibTex URLs, for example: https://scholar.googleusercontent.com/scholar.bib?q=info:ifd-RAVUVasJ:scholar.google.com/&output=citation&scisdr=CgVDYtsfELLGwov-iJo:AAGBfm0AAAAAYgD4kJr6XdMvDPuv7R8SGODak6AxcJxi&scisig=AAGBfm0AAAAAYgD4kHUUPiUnYgcIY1Vo56muYZpFkG5m&scisf=4&ct=citation&cd=-1&hl=en
    return bibtex_data

Alternatively, you can achieve the same thing using Google Scholar API from SerpApi without the need to figure out what proxy provider provides good proxies as well as with the CAPTCHA solving service, besides figuring out how to scrape the data from the JavaScript without using browser automation.

Example to integrate:

import os
from serpapi import GoogleSearch

def organic_results() -> list:
    params = {
        "api_key": os.getenv("API_KEY"),
        "engine": "google_scholar",
        "q": "samsung",  # search query
        "hl": "en",      # language
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    return [result["result_id"] for result in results["organic_results"]]


def cite_results() -> list:

    citation_results = []

    for citation in organic_results():
        params = {
            "api_key": os.getenv("API_KEY"),
            "engine": "google_scholar_cite",
            "q": citation
        }

        search = GoogleSearch(params)
        results = search.get_dict()

        for result in results["links"]:
            if "BibTeX" in result["name"]:
                citation_results.append(result["link"])

    return citation_results

If you would like to parse the data from all available pages, there's a dedicated blog post Scrape historic Google Scholar results using Python at SerpApi which is all about scraping historic 2017-2021 Organic, Cite results to CSV and SQLite using pagination.

Disclaimer, I work for SerpApi.

Upvotes: 0

kyle
kyle

Reputation: 683

Ok, so I figured it out. I used the selenium module for python which creates a virtual browser if you will that will allow you to perform actions like clicking links and getting the output of the resulting HTML. There was another issue I ran into while solving this which was the page had to be loaded otherwise it just returned the content "Loading..." in the pop-up div so I used the python time module to time.sleep(2) for 2 seconds which allowed the content to load in. Then I just parsed the resulting HTML output using BeautifulSoup to find the anchor tag with the class "gs_citi". Then pulled the href from the anchor and put this into a request with "requests" python module. Finally, I wrote the decoded response to a local file - scholar.bib.

I installed chromedriver and selenium on my Mac using these instructions here: https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f

Then signed by python file to allow to stop firewall issues using these instructions: Add Python to OS X Firewall Options?

The following is the code I used to produce the output file "scholar.bib":

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req

# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")

# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()

print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds

# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source

# We are done with the driver so quit.
driver.quit()

# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')

# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})

# Get the href attribute of the first anchor found
href = gs_citt['href']

print("Fetching: ", href)

# Instantiate a new requests session
session = req.Session()

# Get the response object of href
content = session.get(href)

# Get the content and then decode() it.
bibtex_html = content.content.decode()

# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
    file.writelines(bibtex_html)

Hope this helps anyone looking for a solution to this out.

Scholar.bib file:

@article{arrow2013sustainability,
  title={Sustainability and the measurement of wealth: further reflections},
  author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
  journal={Environment and Development Economics},
  volume={18},
  number={4},
  pages={504--516},
  year={2013},
  publisher={Cambridge University Press}
}

Upvotes: 3

Related Questions