Reputation: 363304
When you have incorrectly decoded characters, how can you identify likely candidates for the original string?
Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png
I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.
Is the corruption reversible?
Upvotes: 4
Views: 4549
Reputation: 1
My first guess would've been that your gibberish was Shift JIS mistakenly encoded as IBM437. Personally, I'd use this website here in the future (be sure to push the Swap button after encoding but before decoding). (I used to use string-functions.com for this sorta thing, but it's gone now.)
Upvotes: -2
Reputation: 620
You could use chardet (install with pip):
import chardet
your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]
try:
correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
print("Could not estimate encoding")
Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)
For Python 3 (source file encoded as utf8):
import chardet
import codecs
falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"
try:
encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
print("could not encode falsely decoded string")
encoded_str = None
if encoded_str:
detected_encoding = chardet.detect(encoded_str)["encoding"]
try:
correct_str = encoded_str.decode(detected_encoding)
except UnicodeEncodeError:
print("could not decode encoded_str as %s" % detected_encoding)
with codecs.open("output.txt", "w", "utf-8-sig") as out:
out.write(correct_str)
In summary:
>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'
Upvotes: 6