That1Guy
That1Guy

Reputation: 7233

How to handle redirects with python/urllib status code still 200?

I'm having trouble handling a certain redirect with Python. I'm requesting a page that apparently loads and immediately redirects to ww1.www.com. I'm assuming this is the case because I've tried every method I know of returning headers/status codes and always end up with appropriate results (status code: 200, appropriate hosts/referrer params, etc).

Here is what I have:

from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras

db = psycopg2.connect(
                     host = 'myIP'
                     database = 'myDATABASE'
                     user = 'myUSERNAME'
                     password = 'myPASSWORD'
                     )

cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT info FROM table")

for row in cursor:
    url = 'http://www.website.com/' + row['info']
    file_pointer = urllib.urlopen(url)
    html_object = BeautifulSoup(file_pointer)

    if file_pointer.getcode() != 200:
        continue

The if statement should prevent any further code from being executed if the status code does not equal 200, however I get Index Errors in code after this section, and after investigating the url that provides the error, I find that it redirects without giving me a status code: 302.

Any thoughts as to why I would be getting a 200 status code response while still redirecting? (I've also tried equivalents with urllib2 and httplib) Also, how can I prevent this from happening?

Upvotes: 1

Views: 2355

Answers (1)

Jon Clements
Jon Clements

Reputation: 142216

one thing that doesn't look right

html_object = BeautifulSoup(file_pointer) should operate on the data from urlopen, not the handle:- so - html_object = BeautifulSoup(file_pointer.read()) is what's wanted here...

for debugging

Install requests if you haven't already - it's a great library to use for these kind of things.

Then:

import requests
for row in cursor:
    page = requests.get('your url')
    for hist in page.history:
        print hist.status_code, hist.url

And see if that throws out anything that's puzzling...

Upvotes: 2

Related Questions