Penny
Penny

Reputation: 1280

Use Python xlsxwriter module to write srt data into and excel

this time I tried to use Python's xlsxwriter module to write data from a .srt into an excel.

The subtitle file looks like this in sublime text:

but I want to write the data into an excel, so it looks like this:

It's my first time to code python for this, so I'm still in the stage of trial and error...I tried to write some code like below

but I don't think it makes sense...

I'll continue trying out, but if you know how to do it, please let me know. I'll read your code and try to understand them! Thank you! :)

Upvotes: 1

Views: 1383

Answers (1)

Bryant
Bryant

Reputation: 622

The following breaks the problem into a few pieces:

  • Parsing the input file. parse_subtitles is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'. The approach I took was to track which of three distinct states we're in:
    1. seeking to next entry for when we're looking for the next index number, which should match the regular expression ^\d*$ (nothing but a bunch of numbers)
    2. looking for timestamp when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$ (HH:MM:SS,mmm -> HH:MM:SS,mmm) and
    3. reading subtitles while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.
  • Writing the above records to a row in a worksheet. write_dict_to_worksheet accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.
  • Organizaing the overall conversion convert accepts an input filename (e.g. 'Wildlife.srt' that'll be opened and passed to the parse_subtitles function, and an output filename (e.g. 'Subtitle.xlsx' that will be created using xlsxwriter. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.

Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a : to a ; in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!

I've put a text version of your source file, along with the below code, in this Gist

import xlsxwriter
import re
import logging

def parse_subtitles(lines):
    line_index = re.compile('^\d*$')
    line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
    line_seperator = re.compile('^\s*$')

    current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
    state = 'seeking to next entry'

    for line in lines:
        line = line.strip('\n')
        if state == 'seeking to next entry':
            if line_index.match(line):
                logging.debug('Found index: {i}'.format(i=line))
                current_record['index'] = line
                state = 'looking for timestamp'
            else:
                logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))

        elif state == 'looking for timestamp':
            if line_timestamp.match(line):
                logging.debug('Found timestamp: {t}'.format(t=line))
                current_record['timestamp'] = line
                state = 'reading subtitles'
            else:
                logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))

        elif state == 'reading subtitles':
            if line_seperator.match(line):
                logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
                yield current_record
                state = 'seeking to next entry'
                current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
            else:
                logging.debug('Appending to subtitle: {s}'.format(s=line))
                current_record['subtitles'].append(line)

        else:
            logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
    if state == 'reading subtitles':
        # We must have finished the file without encountering a blank line. Dump the last record
        yield current_record

def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
    """
    Write a subtitle-record to a worksheet. 
    Return the row number after those that were written (since this may write multiple rows)
    """
    current_row = row
    #First, horizontally write the entry and timecode
    for (colname, colindex) in columns_for_keys.items():
        if colname != 'subtitles': 
            worksheet.write(current_row, colindex, keyed_data[colname])

    #Next, vertically write the subtitle data
    subtitle_column = columns_for_keys['subtitles']
    for morelines in keyed_data['subtitles']:
        worksheet.write(current_row, subtitle_column, morelines)
        current_row+=1

    return current_row

def convert(input_filename, output_filename):
    workbook = xlsxwriter.Workbook(output_filename)
    worksheet = workbook.add_worksheet('subtitles')
    columns = {'index':0, 'timestamp':1, 'subtitles':2}

    next_available_row = 0
    records_processed = 0
    headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
    next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)

    with open(input_filename) as textfile:
        for record in parse_subtitles(textfile):
            next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
            records_processed += 1

    print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
    workbook.close()

convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')

Edit: Updated to split multiline subtitles across multiple rows in output

Upvotes: 3

Related Questions