Nelly Kong
Nelly Kong

Reputation: 289

Scraping large amount of Google Scholar pages with url

I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts.

Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list?

Upvotes: 3

Views: 2598

Answers (1)

Dmitriy Fialkovskiy
Dmitriy Fialkovskiy

Reputation: 3225

Recently I faced similar issue. I at least eased my collection process with an easy workaround by implementing a random and rather longlasting sleep like this:

import time
import numpy as np

time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds

If you have enough time (let's say launch your parser at night), you can make even bigger pause (3+ times bigger) to assure you won't get captcha.

Furthermore, you can randomly change user-agents in your requests to site, that will mask you even more.

Upvotes: 8

Related Questions