Ronit Malde
Ronit Malde

Reputation: 11

Errors Using Python urllib to read html

I am currently having trouble using urllib on python to open links. My code is basically intended to be able to take the link of an article (variable "url"), open the link (page = urlopen(url))), get the html from the website (html_bytes = page.read()), decode the html (variable html) and then print what it has decoded.

Here is my code:


    from urllib.request import urlopen

    url = "https://www.wsj.com/articles/peloton-says-wait-times-are-down-to-pre-pandemic-levels-11620334234?mod=hp_lista_pos4"

    page = urlopen(url)
    html_bytes = page.read()
    html = html_bytes.decode("utf-8")
    print(html)

Here is my error:

File "c:/Users/Stras/VeraitasBot/urlopentest.py", line 5, in <module>
    page = urlopen(url)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 640, in http_response        
    response = self.parent.error(
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 649, in http_error_default   
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

This code is able to open most links and scrape the html from websites such as New York Times, Fox, CNN, but I always get that error when I try and pull the html from websites such as WSJ (as shown in the example above).

Does anybody know a method where I can consistently scrape info from all websites or how to fix this error? Thanks

Upvotes: 1

Views: 142

Answers (0)

Related Questions