wim
wim

Reputation: 363304

Unbaking mojibake

When you have incorrectly decoded characters, how can you identify likely candidates for the original string?

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.

Is the corruption reversible?

Upvotes: 4

Views: 4549

Answers (2)

Marco Trevisan
Marco Trevisan

Reputation: 1

My first guess would've been that your gibberish was Shift JIS mistakenly encoded as IBM437. Personally, I'd use this website here in the future (be sure to push the Swap button after encoding but before decoding). (I used to use string-functions.com for this sorta thing, but it's gone now.)

Upvotes: -2

galinden
galinden

Reputation: 620

You could use chardet (install with pip):

import chardet

your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)

For Python 3 (source file encoded as utf8):

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

In summary:

>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'

Upvotes: 6

Related Questions