Reputation: 13135
I wish to be able to read an srt file with python3.
These files can be found here: http://www.opensubtitles.org/
With info here: http://en.wikipedia.org/wiki/SubRip
Subrip supports any encoding: ascii or unicode, for example.
If I understand correctly then I need to specify which decoder to use when I use pythons read function. So am I right in saying that I need to know how the file is encoded in order to make this judgement? If so how do I establish that for each file if I have a hundred such files with different sources and language support?
Ultimately I would prefer if I could convert the files so that they are all in utf-8 encoding to start with. But some of these files might be some obscure encoding for all I know.
Please help,
Barry
Upvotes: 3
Views: 4476
Reputation: 323
There's also a decent library for handling SRT files:
https://pypi.python.org/pypi/pysrt
You can specify the encoding when opening and writing SRT files.
Upvotes: 1
Reputation: 181815
You could use the charade
package (formerly chardet
) to detect the encoding.
Upvotes: 2
Reputation: 5391
You can check for the byte order mark at the start of each .srt
file to test for encoding. However, this probably won't work for all files, as it is not a required attribute, and only specified in UTF
files anyways. A check can be performed by
testStr = b'\xff\xfeOtherdata'
if testStr[0:2] == b'\xff\xfe':
print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
print('UTF-16 Big Endian')
#...
What you probably want to do is simply open your file, then decode whatever you pull out of the file into unicode, deal with the unicode representation until you are ready to print, and then encode it back again. See this talk for some more information, and code samples that might be relevant.
Upvotes: 1