Connor McLaughlin
Connor McLaughlin

Reputation: 3

urllib redirect error

I'm trying to scrape tables using urllib and BeautifulSoup, and I get the error:

"urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found"

I've heard that this is related to the site requiring cookies, but I still get this error after my 2nd attempt:

import urllib.request
from bs4 import BeautifulSoup
import re

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
file = opener.open(testURL).read().decode()
soup = BeautifulSoup(file)
tables = soup.find_all('tr',{'style': re.compile("color:#4A3C8C")})
print(tables)

Upvotes: 0

Views: 666

Answers (1)

t.m.adam
t.m.adam

Reputation: 15376

A fiew suggestions:

  • Use HTTPCookieProcessor if you must handle cookies.
  • You don't have to use a custom User-Agent, but if you want to simulate Mozilla you'll have to use it's full name. This site won't accept 'Mozilla/5.0' and will keep redirecting.
  • You can catch such exceptions with HTTPError.

opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0'
opener.addheaders = [('user-agent', user_agent)]

try:
    response = opener.open(testURL)
except urllib.error.HTTPError as e:
    print(e)
except Exception as e:
    print(e)
else: 
    file = response.read().decode()
    soup = BeautifulSoup(file, 'html.parser')
    ... etc ...

Upvotes: 1

Related Questions