Reputation: 381
I have a speech-to-text app and I'm wandering a bit in the dark with how to efficiently handle the response and organize it to a transcription. I feed the transcriber function 45 second chunks like this: all_text = pool.map(transcribe, enumerate(files))
. This is the response I get:
all text: [{'idx': 0, 'text': ['users outnumber', ' future'], 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs', 'file_index': 0, 'words': [{'word': 'users', 'start_time': 0, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'outnumber', 'start_time': 0, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'future', 'start_time': 4, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}]},
{'idx': 1, 'text': ["and the sustainable energy'], 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs', 'file_index': 1, 'words': [{'word': 'and', 'start_time': 45, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'the', 'start_time': 45, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'sustainable', 'start_time': 45, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'energy', 'start_time': 52, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}]}]
So here I had two 45 second chunks from Elon Musks speech. I cut most of the response to make it shorter, but as you can see, there are two chunks, with indexes 0 and 1. I'm wondering how can I get the transcription from this response based on the word starting_time value? Here I took only seconds but of course I can get nanos also. Is it ok to make another list to push all the words and then sort the list using the starting_time? That brings me into my second question: How efficient is this? If I finally have a mile long list of words and other info from multiple users, will there likely be some issues? Would there be some better way of doing this?
EDIT. This is what I tried. It works with short sessions, but the app crashes with longer ones. I wonder if it has something to do with the list getting too big?
words = []
clean_transcript = ''
for word in alternative.words:
words.append({'word': word.word, 'start_time': word.start_time.seconds, 'participant': participant})
words.sort(key=lambda x: x['start_time'])
print('ALL WORDS: ', words)
for w in words:
clean_transcript += w['word'] + ' '
print(clean_transcript)
Is there some obvious "don't do it like this"?
Upvotes: 0
Views: 58
Reputation: 142641
First you should try to use normal for
-loop or rather nested for
-loops.
text = [
{'idx': 0, 'text': ['users outnumber', ' future'], 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs', 'file_index': 0, 'words': [{'word': 'users', 'start_time': 0, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'outnumber', 'start_time': 0, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'future', 'start_time': 4, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}]},
{'idx': 1, 'text': ['and the sustainable energy'], 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs', 'file_index': 1, 'words': [{'word': 'and', 'start_time': 45, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'the', 'start_time': 45, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'sustainable', 'start_time': 45, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}, {'word': 'energy', 'start_time': 52, 'participant': 'str_MIC_Ct3G_con_O6qn4m00bs'}]}
]
for item in text:
print('---', item['idx'], '---')
for word in item['words']:
if word['start_time'] >= 45:
print(word['start_time'], word['word'])
Result:
--- 0 ---
--- 1 ---
45 and
45 the
45 sustainable
52 energy
And later you can try to convert it to list comprehensions.
result = [[(word['start_time'], word['word']) for word in item['words'] if word['start_time'] >= 45] for item in text]
print(result)
Result
[[], [(45, 'and'), (45, 'the'), (45, 'sustainable'), (52, 'energy')]]
Or without start_time
result = [[word['word'] for word in item['words'] if word['start_time'] >= 45] for item in text]
print(result)
Result
[[], ['and', 'the', 'sustainable', 'energy']]
Or if you want to create flat list instead of sublists
result = [word['word'] for item in text for word in item['words'] if word['start_time'] >= 45]
print(result)
Result
['and', 'the', 'sustainable', 'energy']
Upvotes: 1