Peter
Peter

Reputation: 301

Crawling Google Scholar

I am trying to get information on a large number of scholarly articles as part of my research study. The number of articles is on the order of thousands. Since Google Scholar does not have an API, I am trying to scrape/crawl scholar. Now I now, that this is technically against the EULA, but I am trying to be very polite and reasonable about this. I understand that Google doesn't allow bots in order to keep traffic within reasonable limits. I started with a test batch of ~500 hundred requests with 1s in between each request. I got blocked after about the first 100 requests. I tried multiple other strategies including:

  1. Extending the pauses to ~20s and adding some random noise to them
  2. Making the pauses log-normally distributed (so that most pauses are on the order of seconds but every now and then there are longer pauses of several minutes and more)
  3. Doing long pauses (several hours) between blocks of requests (~100).

I doubt that at this point my script is adding any considerable traffic over what any human would. But one way or the other I always get blocked after ~100-200 requests. Does anyone know of a good strategy to overcome this (I don't care if it takes weeks, as long as it is automated). Also, does anyone have experience dealign with Google directly and asking for permission to do something similar (for research etc.)? Is it worth trying to write them and explain what I'm trying to do and how, and see whether I can get permission for my project? And how would I go about contacting them? Thanks!

Upvotes: 10

Views: 5659

Answers (1)

Morten Bergfall
Morten Bergfall

Reputation: 2316

Without testing, I'm still pretty sure one of the following does the trick :

  1. Easy, but small chance of success :

    Delete all cookies from site in question after every rand(0,100) request,
    then change your user-agent, accepted language, etc. and repeat.

  2. A bit more work, but a much sturdier spider as result :

    Send your requests through Tor, other proxies, mobile networks, etc. to mask your IP (also do suggestion 1 at every turn)

Update regarding Selenium I missed the fact that you're using Selenium, took for granted it was some kind of modern programming language only (I know that Selenium can be driven by most widely used languages, but also as some sort of browser plug-in, demanding very little programming skills).

As I then presume your coding skills aren't (or weren't?) mind-boggling, and for others with the same limitations when using Selenium, my answer is to either learn a simple, scripting language (PowerShell?!) or JavaScript (since it's the web you're on ;-)) and take it from there.

If automating scraping smoothly was as simple as a browser plug-in, the web would have to be a much more messy, obfuscated and credential demanding place.

Upvotes: 3

Related Questions