Reputation: 1
I am trying to access a website called scopus.com . What I want to do is to search an author in it and get his number of publications, h-index, etc. This website cannot be accessed if you are not on a university wifi network (I use VPN whenever I want to access it from home).
Here is the code:
import urllib
first_name = "John"
last_name = "Smith"
new_url = "http://www.scopus.com/results/authorNamesList.url?sort=\
count-f&src=al&sid=66892931B99391BF99AFADC3006D1357.WXhD7YyTQ6A7Pvk9AlA%3a50\
&sot=al&sdt=al&sl=47&s=AUTH--LAST--NAME%28" + last_name + \
"%29+AND+AUTH--FIRST%28" + first_name + "%29&st1=" + last_name + "&st2=" + first_name +\
"&orcidId=&selectionPageSearch=anl&reselectAuthor=false&activeFlag=false&showDocument=\
false&resultsPerPage=20&offset=1&jtp=false¤tPage=1&previousSelectionCount=\
0&tooManySelections=false&previousResultCount=0&authSubject=LFSC&authSubject=\
HLSC&authSubject=PHSC&authSubject=SOSC&exactAuthorSearch=false&showFullList=\
false&authorPreferredName=&origin=searchauthorlookup&affiliationId=&txGid=\
66892931B99391BF99AFADC3006D1357.WXhD7YyTQ6A7Pvk9AlA%3a5"
page_source = urllib.urlopen(new_url).read()
print page_source
No matter what I do I always get this error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 386, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x102c85a28>)
I have spent some time on this forum and I think I have tried everything I could find (including pretending to access the website as Opera). Is there anyway that I can do this or should I just give it up and do this 700 times manually?
Thank you everyone for your help in advance
Upvotes: 0
Views: 1581
Reputation: 745
This is not related to your VPN. The main problem is you are trying to get a page which you must have a valid session (which is present on the browser's request-response cicle). Your options:
But in any case I encorage you to use the API for this kind of problems: Elsevier API.
Upvotes: 1
Reputation: 220
Really simply, a 401 Error means that you are unauthorized (and generally, must have a login to access the site). That also being said, what you are doing is expressly prohibited based on their robots.txt file, so I'd advise you to not persist.
That being said, if you were to continue being interested in crawling other websites, I'd say you should take a look at the Python Requests Module, as well as Beautiful Soup.
Upvotes: 0