Reputation: 1280
this time I tried to use Python's xlsxwriter module to write data from a .srt into an excel.
The subtitle file looks like this in sublime text:
but I want to write the data into an excel, so it looks like this:
It's my first time to code python for this, so I'm still in the stage of trial and error...I tried to write some code like below
but I don't think it makes sense...
I'll continue trying out, but if you know how to do it, please let me know. I'll read your code and try to understand them! Thank you! :)
Upvotes: 1
Views: 1383
Reputation: 622
The following breaks the problem into a few pieces:
parse_subtitles
is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'
. The approach I took was to track which of three distinct states we're in:
seeking to next entry
for when we're looking for the next index number, which should match the regular expression ^\d*$
(nothing but a bunch of numbers)looking for timestamp
when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$
(HH:MM:SS,mmm -> HH:MM:SS,mmm) and reading subtitles
while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.write_dict_to_worksheet
accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.convert
accepts an input filename (e.g. 'Wildlife.srt'
that'll be opened and passed to the parse_subtitles
function, and an output filename (e.g. 'Subtitle.xlsx'
that will be created using xlsxwriter
. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a :
to a ;
in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!
I've put a text version of your source file, along with the below code, in this Gist
import xlsxwriter
import re
import logging
def parse_subtitles(lines):
line_index = re.compile('^\d*$')
line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
line_seperator = re.compile('^\s*$')
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
state = 'seeking to next entry'
for line in lines:
line = line.strip('\n')
if state == 'seeking to next entry':
if line_index.match(line):
logging.debug('Found index: {i}'.format(i=line))
current_record['index'] = line
state = 'looking for timestamp'
else:
logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))
elif state == 'looking for timestamp':
if line_timestamp.match(line):
logging.debug('Found timestamp: {t}'.format(t=line))
current_record['timestamp'] = line
state = 'reading subtitles'
else:
logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))
elif state == 'reading subtitles':
if line_seperator.match(line):
logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
yield current_record
state = 'seeking to next entry'
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
else:
logging.debug('Appending to subtitle: {s}'.format(s=line))
current_record['subtitles'].append(line)
else:
logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
if state == 'reading subtitles':
# We must have finished the file without encountering a blank line. Dump the last record
yield current_record
def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
"""
Write a subtitle-record to a worksheet.
Return the row number after those that were written (since this may write multiple rows)
"""
current_row = row
#First, horizontally write the entry and timecode
for (colname, colindex) in columns_for_keys.items():
if colname != 'subtitles':
worksheet.write(current_row, colindex, keyed_data[colname])
#Next, vertically write the subtitle data
subtitle_column = columns_for_keys['subtitles']
for morelines in keyed_data['subtitles']:
worksheet.write(current_row, subtitle_column, morelines)
current_row+=1
return current_row
def convert(input_filename, output_filename):
workbook = xlsxwriter.Workbook(output_filename)
worksheet = workbook.add_worksheet('subtitles')
columns = {'index':0, 'timestamp':1, 'subtitles':2}
next_available_row = 0
records_processed = 0
headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)
with open(input_filename) as textfile:
for record in parse_subtitles(textfile):
next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
records_processed += 1
print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
workbook.close()
convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')
Edit: Updated to split multiline subtitles across multiple rows in output
Upvotes: 3