Kayot
Kayot

Reputation: 602

Fix Filename that was changed to ASCII from UTF8

I recently downloaded a pack of videos that should have Japanese characters as their file names. Instead who ever uploaded them botched the formatting.

Instead of Kana, Hiragana, and Kanji I get;

002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4

I was wondering if there was a way to fix this short of asking for another upload?

I tried to put the names into a Text file and then hex edit that file to change it's encoding, but that didn't work.

Upvotes: 0

Views: 999

Answers (1)

Josh Lee
Josh Lee

Reputation: 177604

I would use the chardet library for Python as an aid to guess at the encoding.

>>> import chardet
>>> s='002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4'
>>> chardet.detect(s.encode('l1'))
{'encoding': 'ISO-8859-5', 'confidence': 0.536359806931924, 'language': 'Russian'}
>>> chardet.detect(s.encode('cp437'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
>>> chardet.detect(s.encode('cp850'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}

Probably not ISO-8859-1, more likely IBM 437 or 850.

>>> s.encode('cp850').decode('sjis')
'002撫⊃ペッティング(ブルマ).mp4'
>>> s.encode('cp437').decode('sjis')
'002撫○ペッティング(ブルマ).mp4'

Could be either one of these, but I can't read them.

Upvotes: 2

Related Questions