I cannot open a website that exists

Question

I am getting an error that makes me believe my program is unable to find a website I know exists. the website is

https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207

My code looks like

from urllib import request as u_r

def strip_webite():

  with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207") as f:
      pass

if __name__ == "__main__":
  strip_webite()

And the error I get is

  File "database_management.py", line 19, in 
    strip_webite()
  File "database_management.py", line 15, in strip_webite
    with u_r.urlopen("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207") as f:
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 404: Not Found

Adam Barnes · Accepted Answer

It looks like Transfermarkt is blocking requests from bots with the default User-Agent string sent by Python's urllib library, though it doesn't mention anything about that in its robots.

This seems to imply they don't mind us scraping them, but they'd prefer we announce who we are.

To do so with urllib, do the following:

from urllib import request as u_r

def strip_webite():

  request = u_r.Request("https://www.transfermarkt.com/marco-reus/verletzungen/spieler/35207")
  request.add_header('User-Agent', 'my-cool-app')
  with u_r.urlopen(request) as f:
      pass

if __name__ == "__main__":
  strip_webite()

I cannot open a website that exists

Answers (1)

Related Questions