Reputation:
I found a site which fixes my mojibake, here that uses the python package ftfy. I tried reproducing the steps given, although it seems to pre-convert the string before running the steps it gives me.
The string I am trying to fix is EvðŸ’👸ðŸ»
, although the site seems to pre-convert it to EvðŸâ\x80\x99Â\x9dðŸâ\x80\x98¸ðŸÂ\x8f»
before attempting to fix it with the same steps as I am below.
My question is, how can I get my string in the same state as the site, before running the fix_broken_unicode
function, to hopfully avoid the error I am facing?
When running my script, (probably due to me not pre-converting) I receive:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-4: ordinal not in range(256)
The source code for mentioned website can be found at: https://github.com/simonw/ftfy-web/blob/master/ftfy_app.py, although because I am primarily a C++ developer I can't understand it.
My script:
import ftfy.bad_codecs
def fix_broken_unicode(string):
string = string.encode('latin-1')
string = string.decode('utf-8')
string = string.encode('sloppy-windows-1252')
string = string.decode('utf-8')
return string
print(fix_broken_unicode("EvðŸ’👸ðŸ»"))
Updates since answer:
My input: "EvðŸ’👸ðŸ»"
, expected outcome: Ev💝👸🏻
Upvotes: 0
Views: 9717
Reputation: 177685
Your data string might be missing some non-printable characters:
>>> s = 'EvðŸ’\x9d👸ðŸ\x8f»' # \x9d and \x8f aren't printable.
>>> print(s) # This looks like your mojibake.
EvðŸ’👸ðŸ»
>>> s.encode('mbcs').decode('utf8')
'Ev💝👸🏻'
Note that Python's mbcs
codec corresponds to Windows default ANSI codec.
It matches "sloppy-windows1252" only if Windows-1252
is the default ANSI codec (US- and Western European-localized versions of Windows), which is what I am running.
The other option is your original UTF-8 data was decoded with .decode('cp1252',errors='ignore')
. If this is the case the two bytes were lost and the string isn't reversible.
Upvotes: 1