user5405648
user5405648

Reputation:

Python: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-4: ordinal not in range(256)

I found a site which fixes my mojibake, here that uses the python package ftfy. I tried reproducing the steps given, although it seems to pre-convert the string before running the steps it gives me.

The string I am trying to fix is EvðŸ’👸ðŸ», although the site seems to pre-convert it to EvðŸâ\x80\x99Â\x9dðŸâ\x80\x98¸ðŸÂ\x8f» before attempting to fix it with the same steps as I am below.

My question is, how can I get my string in the same state as the site, before running the fix_broken_unicode function, to hopfully avoid the error I am facing?

When running my script, (probably due to me not pre-converting) I receive:

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-4: ordinal not in range(256)

The source code for mentioned website can be found at: https://github.com/simonw/ftfy-web/blob/master/ftfy_app.py, although because I am primarily a C++ developer I can't understand it.

My script:

import ftfy.bad_codecs 

def fix_broken_unicode(string):
    string = string.encode('latin-1')
    string = string.decode('utf-8')
    string = string.encode('sloppy-windows-1252')
    string = string.decode('utf-8')
    return string
    
print(fix_broken_unicode("EvðŸ’👸ðŸ»"))

Updates since answer:

My input: "EvðŸ’👸ðŸ»", expected outcome: Ev💝👸🏻

Upvotes: 0

Views: 9717

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177685

Your data string might be missing some non-printable characters:

>>> s = 'EvðŸ’\x9d👸ðŸ\x8f»'  # \x9d and \x8f aren't printable.
>>> print(s)                    # This looks like your mojibake.
EvðŸ’👸ðŸ»
>>> s.encode('mbcs').decode('utf8')
'Ev💝👸🏻'

Note that Python's mbcs codec corresponds to Windows default ANSI codec. It matches "sloppy-windows1252" only if Windows-1252 is the default ANSI codec (US- and Western European-localized versions of Windows), which is what I am running.

The other option is your original UTF-8 data was decoded with .decode('cp1252',errors='ignore'). If this is the case the two bytes were lost and the string isn't reversible.

Upvotes: 1

Related Questions