Python aligning audio transcription to script for subtitles using word similarity

Question

I'm attempting to align a string of text transcribed by a fork of OpenAI's Whisper model, WhisperX, to the original script being read off to automate generating accurate subtitles. WhisperX provides word level timestamps of each word that would be useful to have connected to the original text for displaying accurate subtitles. I have the timings of each of the transcribed words, it's just mapping them onto the script words so that I can show the subtitles at the right times now. Obviously the transcribed text produced will not match the original script one to one because of unique spellings, different punctuation, compound words, etc.

Here is an example of the two array inputs and the desired array output.

transcription = ['The', 'hive', 'mind' 'dips', 'and', 'dabbles,', 'care', 'free', 'diving', 'for', 'la', 'la', 'le', 'loo, 'Brian,', 'bobbing', 'heads', 'and', 'bottoms', 'rolls,', 'webbed', 'feet', 'and', 'orange', 'beaks.']

script = ['The', 'hivemind', 'dips', 'and', 'dabbles,', 'carefree', 'Diving', 'for', 'lalaleeloo', 'Bryan.', 'Bobbing', 'heads', 'and', 'bottoms', 'rolls,', 'Webbed-feet', 'and', 'orange', 'beaks.']

indexes = [1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 9, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 16, 17, 18, 19]

I've tried using jellyfish and SequenceMatcher. I've also attempted plugging everything into a chatgpt prompt which usually returns the best result but is unstable and I think a algorithm is the best approach. I've tried using a combination of Jaro Distance, basic letter order sharing between each string and the relative distance between where the last word aligned was all as weights to determine a score of which script word to align the transcription to but I haven't got that right. Here is a current version.

`def do(num, line): segments = transcribe.do(num)['segments']

transcribed_lines = []
for segment in segments:
    for word in segment['words']:
        transcribed_lines.append(word['text'])

print(transcribed_lines)
print(line.split(' '))

def find_best_match(word, word_list):
    best_match = None
    highest_ratio = 0
    for i, w in enumerate(word_list):
        ratio = SequenceMatcher(None, word, w).ratio()
        if ratio > highest_ratio:
            best_match = i
            highest_ratio = ratio
    return best_match

alignment = []
for word in transcribed_lines:
    match_index = find_best_match(word, line.split(' '))
    alignment.append(match_index)

onword = 0
line_transcribed = []
for segment in segments:
    for word in segment['words']:
        if onword == 0 or alignment[onword] != alignment[onword-1]:
            print(line.split(' ')[onword])
            line_transcribed.append({'text': line.split(' ')[onword], 'start': word['start'], 'end': word['end']})
        onword+=1

print(line_transcribed)
return(line_transcribed)`

Python aligning audio transcription to script for subtitles using word similarity

Answers (0)

Related Questions