Reputation: 65
I am downloading Serbian/Bosnian/Croatian subtitles via VLC player on an Ubuntu machine, and have constantly to manually change characters such as æ, è, and ð into ć, č, and đ so that the player can render them properly. I wanted to make a python3 function that can do that for me, but I got lost tyring to understand string encoding and decoding.
Through chardata.detect
I found that the encoding of .srt
files that VLC player downloads is Windows-1252. So right now, I do something like this:
import codecs
f = codecs.open('my_file.srt', 'r', encoding='Windows-1252')
data = f.read()
data_utf8 = data.encode('utf-8')
f.close()
The thing is, when I print to terminal the content of the data
varible, I might get a fragment like this:
obožavam vaše
.
But, when I print to terminal the content of the data-utf8
variable, that same fragment now looks like this:
obo\xc5\xbeavam va\xc5\xa1e
.
This is not what I expected.
Furthermore, when I now want to save this data to a file
with open('my_utf8_file.srt', 'w') as f:
f.write(data_utf8)
I get TypeError: write() argument must be str, not bytes
.
Can anyone tell me what am I doing wrong?
Upvotes: 1
Views: 2344
Reputation: 8010
You have to use:
with open('my_utf8_file.srt', 'wb') as f:
f.write(data_utf8)
Note the 'b', this marks the file as binary so you can write bytes (like printed by .encode()
) This is also the reason it prints differently.
Alternatively, you can do something like:
with open('my_utf8_file.srt', 'w', encoding='utf-8') as f:
f.write(data)
Upvotes: 1
Reputation: 98931
Try using chardet to determine the correct file encoding.
Open a command line and type:
> chardetect Joker.BDRip.x264-AAA.cyr.srt
Joker.BDRip.x264-AAA.cyr.srt: windows-1251 with confidence 0.8720288218241439
> chardetect Joker.BDRip.x264-AAA.cyr_utf8.srt
Joker.BDRip.x264-AAA.cyr_utf8.srt: utf-8 with confidence 0.99
Install:
pip install chardet
I used this Serbian subtitles to test.
Upvotes: 0