Reputation: 990
I am new to both transformers.js and whisper trying to make return_timestamps
parameter work...
I managed to customize script.js from transformer.js demo locally and added data.generation.return_timestamps = "char";
around line ~447 inside GENERATE_BUTTON click handler in order to pass the parameter. With that change in place I am seeing timestamp appears as chunks (result
in worker.js):
{
"text": " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.",
"chunks": [
{
"timestamp": [0,8],
"text": " And so my fellow Americans ask not what your country can do for you"
},
{
"timestamp": [8,11],
"text": " ask what you can do for your country."
}
]
}
however the chunks are not "char level" granular as expected following the return_timestamps
doc.
I am looking for ideas how to achieve char/word level timestamp granularity with transform.js and whisper. Do some models/tools need to be updated and/or rebuild?
Upvotes: 1
Views: 1725
Reputation: 618
Creator of transformers.js here. Yesterday, I added support for word-level timestamps (v2.4.0). You can use it as follows:
import { pipeline } from '@xenova/transformers';
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en', {
revision: 'output_attentions',
});
let output = await transcriber(url, { return_timestamps: 'word' });
// {
// "text": " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.",
// "chunks": [
// { "text": " And", "timestamp": [0, 0.78] },
// { "text": " so", "timestamp": [0.78, 1.06] },
// { "text": " my", "timestamp": [1.06, 1.46] },
// ...
// { "text": " for", "timestamp": [9.72, 9.92] },
// { "text": " your", "timestamp": [9.92, 10.22] },
// { "text": " country.", "timestamp": [10.22, 13.5] }
// ]
// }
Upvotes: 4
Reputation: 3169
It took me a while to figure out the option for whisper.cpp
to generate word-level timestamps, so sharing an example here (using command line took that comes with the library):
git clone [email protected]:ggerganov/whisper.cpp.git
cd whisper.cpp
bash ./models/download-ggml-model.sh base.en
make
./main -f samples/jfk.wav -oj --max-len 1
This is just a starting point. The command above generates an empty string at the start, punctuation and spaces before words, which may be useful, depending on your use case. Here's a snippet of the output:
"transcription": [
{
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:00,320"
},
"offsets": {
"from": 0,
"to": 320
},
"text": ""
},
{
"timestamps": {
"from": "00:00:00,320",
"to": "00:00:00,370"
},
"offsets": {
"from": 320,
"to": 370
},
"text": " And"
},
{
"timestamps": {
"from": "00:00:00,370",
"to": "00:00:00,690"
},
"offsets": {
"from": 370,
"to": 690
},
"text": " so"
},
Upvotes: -1
Reputation: 990
As of today (2023-04-06) openai whisper does not provide word or char level granularity for timestamps. There are some other projects based on whisper that do:
Upvotes: -1