tommaso crosta
tommaso crosta

Reputation: 15

Clicking Google Scholar Button with Scrapy

I am trying to scrape some data from Google Scholar with scrapy, my code is the following:

import scrapy
class TryscraperSpider(scrapy.Spider):
    name = 'tryscraper'
    start_urls = ['https://scholar.google.com/citations?hl=en&user=JUn8PgwAAAAJ&pagesize=100&view_op=list_works&sortby=pubdate']

    def parse(self, response):
        for link in response.css('a.gsc_a_at::attr(href)'):
            yield response.follow(link.get(), callback=self.parse_scholar)
       
       
    def parse_scholar(self, response):
        try: 
            yield {
               'authors': response.css('div.gsc_oci_value::text').get().strip(),
               'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
               'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
               'abstract': response.css('div.gsh_csp::text').get()
                 }
        except: 
            yield {
                'authors': response.css('div.gsc_oci_value::text').get().strip(),
                'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
                'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
                'abstract': 'NA'
                 }
            

This code works well, but it only gives me the first 100 papers from the author, I would like to scrape them all, but I would need to code the spider to also press the button "Show More". I have seen in related posts that scrapy does not have built in functions to do so, but that maybe you can incorporate functionalities from selenium to do the job. Unfortunately, I am a bit of a novice and therefore completely lost, any suggestions? Thanks in advance.

Here there is the selenium code that should do the job, but I would like it to combine it with my scrapy spider, which works well and it's very fast.

Upvotes: 0

Views: 271

Answers (1)

SIM
SIM

Reputation: 22440

Check out the following implementation. This should give you all the results from that page exhausting show more button.

import scrapy
import urllib
from scrapy import Selector

class ScholarSpider(scrapy.Spider):
    name = 'scholar'
    start_url = 'https://scholar.google.com/citations?'
    
    params = {
        'hl': 'en',
        'user': 'JUn8PgwAAAAJ',
        'view_op': 'list_works',
        'sortby': 'pubdate',
        'cstart': 0,
        'pagesize': '100'
    }

    def start_requests(self):
        req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
        yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)


    def parse(self, response):
        if not response.json()['B']:
            return

        resp = Selector(text=response.json()['B'])
        for item in resp.css("tr > td > a[href^='/citations']::attr(href)").getall():
            inner_link = f"https://scholar.google.com{item}"
            yield scrapy.Request(inner_link,callback=self.parse_content)

        self.params['cstart']+=100
        req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
        yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)


    def parse_content(self,response):
        yield {
            'authors': response.css(".gsc_oci_field:contains('Author') + .gsc_oci_value::text").get(),
            'journal': response.css(".gsc_oci_field:contains('Journal') + .gsc_oci_value::text").get(),
            'date': response.css(".gsc_oci_field:contains('Publication date') + .gsc_oci_value::text").get(),
            'abstract': response.css("#gsc_oci_descr .gsh_csp::text").get()
        }

Upvotes: 0

Related Questions