user9682383
user9682383

Reputation:

Change encoding for locally stored .html files downloaded with urllib.request.urlretrieve()

I used the following python code to save an html file to local storage:

url = "some_url.html
urllib.request.urlretrieve(url, 'save/to/path')

This successfully saves the file with a .html extension. When I attempt to open the file with:

html_doc = open('save/to/path/some_url.html', 'r')

I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 36255: ordinal not in range(128)

I think this means I am attempting to read a utf-8 file with a ascii codec. I attempted the solution found at:

Convert Unicode to ASCII without errors in Python

But this, as well as other solutions I have found, only seem to work for encoding the file for immediate viewing and not saved files. I cannot find one that works for altering the encoding of a locally stored file.

Upvotes: 0

Views: 259

Answers (1)

lenz
lenz

Reputation: 5817

The open() function has an optional encoding parameter. Its default is platform dependent, but in your case it apparently defaults to UTF-8.

I you know the correct codec (eg. from a HTTTP header), you can specify it:

html_doc = open('path/to/file.html', 'r', encoding='cp1252')

If you don't know it, chances are that it is written in the file. You can open the file in binary mode:

html_doc = open('path/to/file.html', 'rb')

and then try to find an encoding declaration and decode the whole thing in memory.

However, don't do that. There's not much use in opening and processing HTML like a text file. You should use an HTML parser to walk through the document tree and extract whatever you need. Python's standard library has one, but you might find Beautiful Soup easier to use.

Upvotes: 1

Related Questions