MarkoF
MarkoF

Reputation: 65

Convert Windows-1252 subtitle file to utf-8

I am downloading Serbian/Bosnian/Croatian subtitles via VLC player on an Ubuntu machine, and have constantly to manually change characters such as æ, è, and ð into ć, č, and đ so that the player can render them properly. I wanted to make a python3 function that can do that for me, but I got lost tyring to understand string encoding and decoding.

Through chardata.detect I found that the encoding of .srt files that VLC player downloads is Windows-1252. So right now, I do something like this:

import codecs

f = codecs.open('my_file.srt', 'r', encoding='Windows-1252')
data = f.read()
data_utf8 = data.encode('utf-8')
f.close()

The thing is, when I print to terminal the content of the data varible, I might get a fragment like this: obožavam vaše. But, when I print to terminal the content of the data-utf8 variable, that same fragment now looks like this: obo\xc5\xbeavam va\xc5\xa1e. This is not what I expected.

Furthermore, when I now want to save this data to a file

with open('my_utf8_file.srt', 'w') as f:
    f.write(data_utf8)

I get TypeError: write() argument must be str, not bytes.

Can anyone tell me what am I doing wrong?

Upvotes: 1

Views: 2344

Answers (2)

mousetail
mousetail

Reputation: 8010

You have to use:

with open('my_utf8_file.srt', 'wb') as f:
    f.write(data_utf8)

Note the 'b', this marks the file as binary so you can write bytes (like printed by .encode()) This is also the reason it prints differently.

Alternatively, you can do something like:

with open('my_utf8_file.srt', 'w', encoding='utf-8') as f:
    f.write(data)

Upvotes: 1

Pedro Lobito
Pedro Lobito

Reputation: 98931

Try using chardet to determine the correct file encoding.
Open a command line and type:

> chardetect Joker.BDRip.x264-AAA.cyr.srt
Joker.BDRip.x264-AAA.cyr.srt: windows-1251 with confidence 0.8720288218241439

> chardetect Joker.BDRip.x264-AAA.cyr_utf8.srt
Joker.BDRip.x264-AAA.cyr_utf8.srt: utf-8 with confidence 0.99

Install:

pip install chardet

I used this Serbian subtitles to test.

Upvotes: 0

Related Questions