Reputation: 107
I am trying to do some web-scraping for a project for my study. Unfortunately I need to try and scrape some data of Google Scholar which blocks my requests. I have tried using (multiple) http proxies but my requests still get blocked after ~300 tries.
The resulting html from the blocked requests contains:
IP address: 145.109...<br/>Time: 2016-05-05T09:23:37Z<br/>URL:
https://scholar.google.nl/citations?hl=en&view_op=search_authors
&mauthors=Perry<br/>
The above IP is my own, while my proxies dict (it selects a proxy from a list at random) and get request look like this:
proxies = {'http': 'http://<username>:<password>@107.182....:<port>'}
result = requests.get('https://scholar.google.nl/citations?hl=en&
amp;view_op=search_authors&mauthors=Perry',
proxies=proxies, headers=headers)
The IPs of are of course valid and work and my own ip is not included in the proxy list. Am I doing something wrong?
Edit: For completeness, i have also tried setting authentication like this answer suggests but the result is the same.
Upvotes: 0
Views: 3216
Reputation: 69082
In your proxies
dict the url scheme doesn't match the one you're using for your request, you use a http
entry for your proxies but then make a https
request. If you ad a proxy for the https
scheme, then it should work.
Upvotes: 2