Reputation: 3330
I'm trying to use a python script to scrape a handful of pages on a government website. I have it accessing a URL that loads a normal webpage in my web browser, but for some reason the script gets an "Access Denied" page instead of the expected page.
Additionally, this "Access Denied" error is unlike any I have ever seen on the government website; I can't achieve this error through any means but my python script.
Here is a stripped down version of my script (it's rather big, so I cut out bits I don't think are relevant):
import requests
headers = {
'Accept': "*/*",
'User-Agent': "nyc_contractors.py",
'X-Love': "hey sysadmin! you're awesome! <3"
}
print "and we're off!"
qLicensetype="C"
qBizname = "a"
baseUrl = "http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname="+qBizname+"&licensetype="+qLicensetype
nextUrl = baseUrl
while nextUrl != None:
print
print "URL:", nextUrl
r = requests.get(nextUrl, headers=headers)
nextUrl = None # kill the url (if there's a next page, we'll restore the url later)
print "actual url:",r.url
lines = r.text.splitlines()
for line in lines:
print "L:", line
And here is the log output from running that script:
and we're off!
URL: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=a&licensetype=C
actual url: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=a&licensetype=C
L: <HTML><HEAD>
L: <TITLE>Access Denied</TITLE>
L: </HEAD><BODY>
L: <H1>Access Denied</H1>
L:
L: You don't have permission to access "http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?" on this server.<P>
L: Reference #18.85600317.1438181595.a09a236f
L: </BODY>
L: </HTML>
For sake of readability, here's what that error page basically looks like:
Access Denied
You don't have permission to access "http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?" on this server.
Reference #18.85600317.1438181008.a0891486
Some things to note:
Does anyone have any idea what the issue might be? Many thanks.
EDIT: Something I forgot to mention. I noticed it was going through multiple requests way faster than seemed feasible, so I thought maybe it was somehow connected to the web server running on the local machine, but I didn't see any requests that looked like they could be the source in my local server's access logs.
EDIT: @Alik suggested I rerun my local script with logging enabled, so here's that output:
URL: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=s&licensetype=B
INFO:urllib3.connectionpool:Starting new HTTP connection (1): a810-bisweb.nyc.gov
DEBUG:urllib3.connectionpool:"GET /bisweb/ResultsByNameServlet?bizname=s&licensetype=B HTTP/1.1" 403 309
actual url: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=s&licensetype=B
L: <HTML><HEAD>
L: <TITLE>Access Denied</TITLE>
L: </HEAD><BODY>
L: <H1>Access Denied</H1>
L:
L: You don't have permission to access "http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?" on this server.<P>
L: Reference #18.85600317.1438184686.a0f4b341
L: </BODY>
L: </HTML>
Upvotes: 5
Views: 5428
Reputation: 111
I encountered this same thing. The problem turned out to be that the website was blocking the python-requests User Agent.
You can check what your user agent is by enabling debugging at the httplib
level as explained in this answer by @Yohann.
You change the user agent used by requests
as explained here by @birryree.
Changing the user agent to one which isn't blocked should fix the problem if that's what's happening to you as well.
Upvotes: 3
Reputation: 3330
Okay this is the stupidest problem and I literally don't understand why it would happen, but I solved it. Anyway, the script in my post works because I accidentally changed my User-Agent there to "nyc_contractors.py". The script fails on my end because, in the actual version I was running, I had "nyc_contractor_scraper.py", and for some reason it doesn't like that specific user agent (maybe it blacklists "scraper"? who knows)
Upvotes: 3