Reputation: 602
I recently downloaded a pack of videos that should have Japanese characters as their file names. Instead who ever uploaded them botched the formatting.
Instead of Kana, Hiragana, and Kanji I get;
002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4
I was wondering if there was a way to fix this short of asking for another upload?
I tried to put the names into a Text file and then hex edit that file to change it's encoding, but that didn't work.
Upvotes: 0
Views: 999
Reputation: 177604
I would use the chardet library for Python as an aid to guess at the encoding.
>>> import chardet
>>> s='002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4'
>>> chardet.detect(s.encode('l1'))
{'encoding': 'ISO-8859-5', 'confidence': 0.536359806931924, 'language': 'Russian'}
>>> chardet.detect(s.encode('cp437'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
>>> chardet.detect(s.encode('cp850'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
Probably not ISO-8859-1, more likely IBM 437 or 850.
>>> s.encode('cp850').decode('sjis')
'002撫⊃ペッティング(ブルマ).mp4'
>>> s.encode('cp437').decode('sjis')
'002撫○ペッティング(ブルマ).mp4'
Could be either one of these, but I can't read them.
Upvotes: 2