Clicking Google Scholar Button with Scrapy

Question

I am trying to scrape some data from Google Scholar with scrapy, my code is the following:

import scrapy
class TryscraperSpider(scrapy.Spider):
    name = 'tryscraper'
    start_urls = ['https://scholar.google.com/citations?hl=en&user=JUn8PgwAAAAJ&pagesize=100&view_op=list_works&sortby=pubdate']

    def parse(self, response):
        for link in response.css('a.gsc_a_at::attr(href)'):
            yield response.follow(link.get(), callback=self.parse_scholar)
       
       
    def parse_scholar(self, response):
        try: 
            yield {
               'authors': response.css('div.gsc_oci_value::text').get().strip(),
               'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
               'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
               'abstract': response.css('div.gsh_csp::text').get()
                 }
        except: 
            yield {
                'authors': response.css('div.gsc_oci_value::text').get().strip(),
                'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
                'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
                'abstract': 'NA'
                 }

This code works well, but it only gives me the first 100 papers from the author, I would like to scrape them all, but I would need to code the spider to also press the button "Show More". I have seen in related posts that scrapy does not have built in functions to do so, but that maybe you can incorporate functionalities from selenium to do the job. Unfortunately, I am a bit of a novice and therefore completely lost, any suggestions? Thanks in advance.

Here there is the selenium code that should do the job, but I would like it to combine it with my scrapy spider, which works well and it's very fast.

SIM · Accepted Answer

Check out the following implementation. This should give you all the results from that page exhausting show more button.

import scrapy
import urllib
from scrapy import Selector

class ScholarSpider(scrapy.Spider):
    name = 'scholar'
    start_url = 'https://scholar.google.com/citations?'
    
    params = {
        'hl': 'en',
        'user': 'JUn8PgwAAAAJ',
        'view_op': 'list_works',
        'sortby': 'pubdate',
        'cstart': 0,
        'pagesize': '100'
    }

    def start_requests(self):
        req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
        yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)


    def parse(self, response):
        if not response.json()['B']:
            return

        resp = Selector(text=response.json()['B'])
        for item in resp.css("tr > td > a[href^='/citations']::attr(href)").getall():
            inner_link = f"https://scholar.google.com{item}"
            yield scrapy.Request(inner_link,callback=self.parse_content)

        self.params['cstart']+=100
        req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
        yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)


    def parse_content(self,response):
        yield {
            'authors': response.css(".gsc_oci_field:contains('Author') + .gsc_oci_value::text").get(),
            'journal': response.css(".gsc_oci_field:contains('Journal') + .gsc_oci_value::text").get(),
            'date': response.css(".gsc_oci_field:contains('Publication date') + .gsc_oci_value::text").get(),
            'abstract': response.css("#gsc_oci_descr .gsh_csp::text").get()
        }

Clicking Google Scholar Button with Scrapy

Answers (1)

Related Questions