Baz
Baz

Reputation: 13135

Reading srt (subtitle) files with Python3

I wish to be able to read an srt file with python3.

These files can be found here: http://www.opensubtitles.org/

With info here: http://en.wikipedia.org/wiki/SubRip

Subrip supports any encoding: ascii or unicode, for example.

If I understand correctly then I need to specify which decoder to use when I use pythons read function. So am I right in saying that I need to know how the file is encoded in order to make this judgement? If so how do I establish that for each file if I have a hundred such files with different sources and language support?

Ultimately I would prefer if I could convert the files so that they are all in utf-8 encoding to start with. But some of these files might be some obscure encoding for all I know.

Please help,

Barry

Upvotes: 3

Views: 4476

Answers (3)

gruentee
gruentee

Reputation: 323

There's also a decent library for handling SRT files:

https://pypi.python.org/pypi/pysrt

You can specify the encoding when opening and writing SRT files.

Upvotes: 1

Thomas
Thomas

Reputation: 181815

You could use the charade package (formerly chardet) to detect the encoding.

Upvotes: 2

brc
brc

Reputation: 5391

You can check for the byte order mark at the start of each .srt file to test for encoding. However, this probably won't work for all files, as it is not a required attribute, and only specified in UTF files anyways. A check can be performed by

testStr = b'\xff\xfeOtherdata'

if testStr[0:2] == b'\xff\xfe':
    print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
    print('UTF-16 Big Endian')
#...

What you probably want to do is simply open your file, then decode whatever you pull out of the file into unicode, deal with the unicode representation until you are ready to print, and then encode it back again. See this talk for some more information, and code samples that might be relevant.

Upvotes: 1

Related Questions