Yoz
Yoz

Reputation: 990

transformers.js with whisper and return_timestamps

I am new to both transformers.js and whisper trying to make return_timestamps parameter work...

I managed to customize script.js from transformer.js demo locally and added data.generation.return_timestamps = "char"; around line ~447 inside GENERATE_BUTTON click handler in order to pass the parameter. With that change in place I am seeing timestamp appears as chunks (result in worker.js):

{
    "text": " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.",
    "chunks": [
        {
            "timestamp": [0,8],
            "text": " And so my fellow Americans ask not what your country can do for you"
        },
        {
            "timestamp": [8,11],
            "text": " ask what you can do for your country."
        }
    ]
}

however the chunks are not "char level" granular as expected following the return_timestamps doc.

I am looking for ideas how to achieve char/word level timestamp granularity with transform.js and whisper. Do some models/tools need to be updated and/or rebuild?

Upvotes: 1

Views: 1725

Answers (3)

Xenova
Xenova

Reputation: 618

Creator of transformers.js here. Yesterday, I added support for word-level timestamps (v2.4.0). You can use it as follows:

import { pipeline } from '@xenova/transformers';

let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en', {
    revision: 'output_attentions',
});
let output = await transcriber(url, { return_timestamps: 'word' });
// {
//   "text": " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.",
//   "chunks": [
//     { "text": " And", "timestamp": [0, 0.78] },
//     { "text": " so", "timestamp": [0.78, 1.06] },
//     { "text": " my", "timestamp": [1.06, 1.46] },
//     ...
//     { "text": " for", "timestamp": [9.72, 9.92] },
//     { "text": " your", "timestamp": [9.92, 10.22] },
//     { "text": " country.", "timestamp": [10.22, 13.5] }
//   ]
// }

Upvotes: 4

Ultrasaurus
Ultrasaurus

Reputation: 3169

It took me a while to figure out the option for whisper.cpp to generate word-level timestamps, so sharing an example here (using command line took that comes with the library):

git clone [email protected]:ggerganov/whisper.cpp.git
cd whisper.cpp
bash ./models/download-ggml-model.sh base.en
make
./main -f samples/jfk.wav -oj --max-len 1

This is just a starting point. The command above generates an empty string at the start, punctuation and spaces before words, which may be useful, depending on your use case. Here's a snippet of the output:

    "transcription": [
        {
            "timestamps": {
                "from": "00:00:00,000",
                "to": "00:00:00,320"
            },
            "offsets": {
                "from": 0,
                "to": 320
            },
            "text": ""
        },
        {
            "timestamps": {
                "from": "00:00:00,320",
                "to": "00:00:00,370"
            },
            "offsets": {
                "from": 320,
                "to": 370
            },
            "text": " And"
        },
        {
            "timestamps": {
                "from": "00:00:00,370",
                "to": "00:00:00,690"
            },
            "offsets": {
                "from": 370,
                "to": 690
            },
            "text": " so"
        },

Upvotes: -1

Yoz
Yoz

Reputation: 990

As of today (2023-04-06) openai whisper does not provide word or char level granularity for timestamps. There are some other projects based on whisper that do:

Upvotes: -1

Related Questions