Reputation:
I used the following python code to save an html file to local storage:
url = "some_url.html
urllib.request.urlretrieve(url, 'save/to/path')
This successfully saves the file with a .html extension. When I attempt to open the file with:
html_doc = open('save/to/path/some_url.html', 'r')
I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 36255: ordinal not in range(128)
I think this means I am attempting to read a utf-8 file with a ascii codec. I attempted the solution found at:
Convert Unicode to ASCII without errors in Python
But this, as well as other solutions I have found, only seem to work for encoding the file for immediate viewing and not saved files. I cannot find one that works for altering the encoding of a locally stored file.
Upvotes: 0
Views: 259
Reputation: 5817
The open()
function has an optional encoding
parameter.
Its default is platform dependent, but in your case it apparently defaults to UTF-8.
I you know the correct codec (eg. from a HTTTP header), you can specify it:
html_doc = open('path/to/file.html', 'r', encoding='cp1252')
If you don't know it, chances are that it is written in the file. You can open the file in binary mode:
html_doc = open('path/to/file.html', 'rb')
and then try to find an encoding declaration and decode the whole thing in memory.
However, don't do that. There's not much use in opening and processing HTML like a text file. You should use an HTML parser to walk through the document tree and extract whatever you need. Python's standard library has one, but you might find Beautiful Soup easier to use.
Upvotes: 1