Reputation: 37
I am using whisper and need to provide accurate results to my end users. The 2 options are:
{ "id": 0, "seek": 3000, "start": 30.0, "end": 33.0, "text": " Mama, take this badge off of me", "tokens": [ 50364, 17775, 11, 747, 341, 25797, 766, 295, 385, 50514 ], "temperature": 0.0, "avg_logprob": -0.29641300439834595, "compression_ratio": 1.2115384340286255, "no_speech_prob": 0.31771889328956604 }, { "id": 1, "seek": 3000, "start": 37.0, "end": 40.0, "text": " I can't use it anymore", "tokens": [ 50714, 286, 393, 380, 764, 309, 3602, 50864 ], "temperature": 0.0, "avg_logprob": -0.29641300439834595, "compression_ratio": 1.2115384340286255, "no_speech_prob": 0.31771889328956604 },
{ "word": "Mama", "start": 30.0, "end": 30.639999389648438 }, { "word": "take", "start": 30.920000076293945, "end": 30.920000076293945 }, { "word": "this", "start": 30.920000076293945, "end": 31.360000610351562 }, { "word": "badge", "start": 31.360000610351562, "end": 31.81999969482422 }, { "word": "off", "start": 31.81999969482422, "end": 32.20000076293945 }, { "word": "of", "start": 32.20000076293945, "end": 32.439998626708984 }, { "word": "me", "start": 32.439998626708984, "end": 33.63999938964844 }, { "word": "I", "start": 33.63999938964844, "end": 37.380001068115234 }, { "word": "can't", "start": 37.380001068115234, "end": 37.81999969482422 }, { "word": "use", "start": 37.81999969482422, "end": 38.279998779296875 }, { "word": "it", "start": 38.279998779296875, "end": 38.939998626708984 }, { "word": "anymore", "start": 38.939998626708984, "end": 40.459999084472656 },
Am I missing in the output in words timestamps to reconstruct the sentences as they are heard by whisper? Do you think of an other way?
I would like to avoid running twice the API call for financial reasons (and planet saving ^^).
Upvotes: 0
Views: 123