Reputation: 15
I am trying to scrape some data from Google Scholar with scrapy
, my code is the following:
import scrapy
class TryscraperSpider(scrapy.Spider):
name = 'tryscraper'
start_urls = ['https://scholar.google.com/citations?hl=en&user=JUn8PgwAAAAJ&pagesize=100&view_op=list_works&sortby=pubdate']
def parse(self, response):
for link in response.css('a.gsc_a_at::attr(href)'):
yield response.follow(link.get(), callback=self.parse_scholar)
def parse_scholar(self, response):
try:
yield {
'authors': response.css('div.gsc_oci_value::text').get().strip(),
'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
'abstract': response.css('div.gsh_csp::text').get()
}
except:
yield {
'authors': response.css('div.gsc_oci_value::text').get().strip(),
'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
'abstract': 'NA'
}
This code works well, but it only gives me the first 100 papers from the author, I would like to scrape them all, but I would need to code the spider to also press the button "Show More". I have seen in related posts that scrapy
does not have built in functions to do so, but that maybe you can incorporate functionalities from selenium
to do the job. Unfortunately, I am a bit of a novice and therefore completely lost, any suggestions? Thanks in advance.
Here there is the selenium
code that should do the job, but I would like it to combine it with my scrapy
spider, which works well and it's very fast.
Upvotes: 0
Views: 271
Reputation: 22440
Check out the following implementation. This should give you all the results from that page exhausting show more
button.
import scrapy
import urllib
from scrapy import Selector
class ScholarSpider(scrapy.Spider):
name = 'scholar'
start_url = 'https://scholar.google.com/citations?'
params = {
'hl': 'en',
'user': 'JUn8PgwAAAAJ',
'view_op': 'list_works',
'sortby': 'pubdate',
'cstart': 0,
'pagesize': '100'
}
def start_requests(self):
req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)
def parse(self, response):
if not response.json()['B']:
return
resp = Selector(text=response.json()['B'])
for item in resp.css("tr > td > a[href^='/citations']::attr(href)").getall():
inner_link = f"https://scholar.google.com{item}"
yield scrapy.Request(inner_link,callback=self.parse_content)
self.params['cstart']+=100
req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)
def parse_content(self,response):
yield {
'authors': response.css(".gsc_oci_field:contains('Author') + .gsc_oci_value::text").get(),
'journal': response.css(".gsc_oci_field:contains('Journal') + .gsc_oci_value::text").get(),
'date': response.css(".gsc_oci_field:contains('Publication date') + .gsc_oci_value::text").get(),
'abstract': response.css("#gsc_oci_descr .gsh_csp::text").get()
}
Upvotes: 0