Hayden Schiff
Hayden Schiff

Reputation: 3330

Python requests gets an error page for a URL that should work fine

I'm trying to use a python script to scrape a handful of pages on a government website. I have it accessing a URL that loads a normal webpage in my web browser, but for some reason the script gets an "Access Denied" page instead of the expected page.

Additionally, this "Access Denied" error is unlike any I have ever seen on the government website; I can't achieve this error through any means but my python script.

Here is a stripped down version of my script (it's rather big, so I cut out bits I don't think are relevant):

import requests

headers = {
    'Accept': "*/*",
    'User-Agent': "nyc_contractors.py",
    'X-Love': "hey sysadmin! you're awesome! <3"
}

print "and we're off!"

qLicensetype="C"
qBizname = "a"

baseUrl = "http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname="+qBizname+"&licensetype="+qLicensetype
nextUrl = baseUrl

while nextUrl != None:

    print
    print "URL:", nextUrl

    r = requests.get(nextUrl, headers=headers)
    nextUrl = None # kill the url (if there's a next page, we'll restore the url later)
    print "actual url:",r.url

    lines = r.text.splitlines()

    for line in lines:
        print "L:", line

And here is the log output from running that script:

and we're off!

URL: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=a&licensetype=C
actual url: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=a&licensetype=C
L: <HTML><HEAD>
L: <TITLE>Access Denied</TITLE>
L: </HEAD><BODY>
L: <H1>Access Denied</H1>
L:  
L: You don't have permission to access "http&#58;&#47;&#47;a810&#45;bisweb&#46;nyc&#46;gov&#47;bisweb&#47;ResultsByNameServlet&#63;" on this server.<P>
L: Reference&#32;&#35;18&#46;85600317&#46;1438181595&#46;a09a236f
L: </BODY>
L: </HTML>

For sake of readability, here's what that error page basically looks like:

Access Denied

You don't have permission to access "http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?" on this server.

Reference #18.85600317.1438181008.a0891486

Some things to note:

Does anyone have any idea what the issue might be? Many thanks.

EDIT: Something I forgot to mention. I noticed it was going through multiple requests way faster than seemed feasible, so I thought maybe it was somehow connected to the web server running on the local machine, but I didn't see any requests that looked like they could be the source in my local server's access logs.

EDIT: @Alik suggested I rerun my local script with logging enabled, so here's that output:

URL: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=s&licensetype=B
INFO:urllib3.connectionpool:Starting new HTTP connection (1): a810-bisweb.nyc.gov
DEBUG:urllib3.connectionpool:"GET /bisweb/ResultsByNameServlet?bizname=s&licensetype=B HTTP/1.1" 403 309
actual url: http://a810-bisweb.nyc.gov/bisweb/ResultsByNameServlet?bizname=s&licensetype=B
L: <HTML><HEAD>
L: <TITLE>Access Denied</TITLE>
L: </HEAD><BODY>
L: <H1>Access Denied</H1>
L:  
L: You don't have permission to access "http&#58;&#47;&#47;a810&#45;bisweb&#46;nyc&#46;gov&#47;bisweb&#47;ResultsByNameServlet&#63;" on this server.<P>
L: Reference&#32;&#35;18&#46;85600317&#46;1438184686&#46;a0f4b341
L: </BODY>
L: </HTML>

Upvotes: 5

Views: 5428

Answers (2)

AntCas
AntCas

Reputation: 111

I encountered this same thing. The problem turned out to be that the website was blocking the python-requests User Agent.

You can check what your user agent is by enabling debugging at the httplib level as explained in this answer by @Yohann.

You change the user agent used by requests as explained here by @birryree.

Changing the user agent to one which isn't blocked should fix the problem if that's what's happening to you as well.

Upvotes: 3

Hayden Schiff
Hayden Schiff

Reputation: 3330

Okay this is the stupidest problem and I literally don't understand why it would happen, but I solved it. Anyway, the script in my post works because I accidentally changed my User-Agent there to "nyc_contractors.py". The script fails on my end because, in the actual version I was running, I had "nyc_contractor_scraper.py", and for some reason it doesn't like that specific user agent (maybe it blacklists "scraper"? who knows)

Upvotes: 3

Related Questions