aquagremlin
aquagremlin

Reputation: 3549

webscraping error with python3 only with some websites

This code works for websites like google and yahoo and returns 'good'

import urllib.request as ur
#url="http://www.evga.com"
#url="http://www.asus.com/us/"
url="http://www.google.com"
import urllib.error as ure

try:
    conn = ur.urlopen(url)
except ure.HTTPError as e:
    # Return code error (e.g. 404, 501, ...)
    # ...
    print('HTTPError: {}'.format(e.code))
except ure.URLError as e:
    # Not an HTTP-specific error (e.g. connection refused)
    # ...
    print('URLError: {}'.format(e.reason))
else:
    # 200
    # ...
    print('good')

but for asus gives error 403 and for EVGA gives no response at all. How do I troubleshoot this problem?

Upvotes: 0

Views: 41

Answers (1)

Alberto Castillo
Alberto Castillo

Reputation: 268

You're having a classic headers problem. urllib is not the best idea because you'll have a lot of implementation problems. Trust me URLLIB is a mess...

For web scraping I recommend either requests or selenium. The first one is a good start.

Let me share a requests version of your code

import requests
url="http://www.evga.com"
#url="http://www.asus.com/us/"
#url="http://www.google.com"

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko Chrome/83.0.4103.97 Safari/537.36"}
r = requests.get(url, headers=headers)
print(r.status_code)

Yields:

200

I noticed "http://www.evga.com" is a troublemaker but using headers you'll have all under control.

More info about requests: https://requests.readthedocs.io/en/master/

Upvotes: 1

Related Questions