py_coder1371
py_coder1371

Reputation: 1

Python urllib.urlopen IOError using VPN

I am trying to access a website called scopus.com . What I want to do is to search an author in it and get his number of publications, h-index, etc. This website cannot be accessed if you are not on a university wifi network (I use VPN whenever I want to access it from home).

Here is the code:

import urllib

first_name = "John"
last_name = "Smith"

new_url = "http://www.scopus.com/results/authorNamesList.url?sort=\
count-f&src=al&sid=66892931B99391BF99AFADC3006D1357.WXhD7YyTQ6A7Pvk9AlA%3a50\
&sot=al&sdt=al&sl=47&s=AUTH--LAST--NAME%28" + last_name + \
"%29+AND+AUTH--FIRST%28" + first_name + "%29&st1=" + last_name + "&st2=" + first_name +\
"&orcidId=&selectionPageSearch=anl&reselectAuthor=false&activeFlag=false&showDocument=\
false&resultsPerPage=20&offset=1&jtp=false&currentPage=1&previousSelectionCount=\
0&tooManySelections=false&previousResultCount=0&authSubject=LFSC&authSubject=\
HLSC&authSubject=PHSC&authSubject=SOSC&exactAuthorSearch=false&showFullList=\
false&authorPreferredName=&origin=searchauthorlookup&affiliationId=&txGid=\
66892931B99391BF99AFADC3006D1357.WXhD7YyTQ6A7Pvk9AlA%3a5"

page_source = urllib.urlopen(new_url).read()

print page_source

No matter what I do I always get this error:

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 386, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)

IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x102c85a28>)

I have spent some time on this forum and I think I have tried everything I could find (including pretending to access the website as Opera). Is there anyway that I can do this or should I just give it up and do this 700 times manually?

Thank you everyone for your help in advance

Upvotes: 0

Views: 1581

Answers (2)

Esparta Palma
Esparta Palma

Reputation: 745

This is not related to your VPN. The main problem is you are trying to get a page which you must have a valid session (which is present on the browser's request-response cicle). Your options:

But in any case I encorage you to use the API for this kind of problems: Elsevier API.

Upvotes: 1

RVT
RVT

Reputation: 220

Really simply, a 401 Error means that you are unauthorized (and generally, must have a login to access the site). That also being said, what you are doing is expressly prohibited based on their robots.txt file, so I'd advise you to not persist.

That being said, if you were to continue being interested in crawling other websites, I'd say you should take a look at the Python Requests Module, as well as Beautiful Soup.

Upvotes: 0

Related Questions