Timestamps reset every 30 seconds when using distil-whisper with return_timestamps=True

Question

Problem

I'm using distil-whisper through the 🤗 Transformers pipeline for speech recognition. When setting return_timestamps=True, the timestamps reset to 0 every 30 seconds instead of continuing to increment throughout the entire audio file.

Here's my current code:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
)

result = pipe("audio.mp4")

Output

The timestamps in the output look like this:

{'chunks': [
    {'text': 'First segment', 'timestamp': (0.0, 5.2)},
    {'text': 'Second segment', 'timestamp': (5.2, 12.8)},
    {'text': 'Later segment', 'timestamp': (28.4, 30.0)},
    {'text': 'Should be ~35s but shows', 'timestamp': (0.0, 4.6)},  # Resets here!
    ...
]}

Expected Behavior

I expect the timestamps to continue incrementing past 30 seconds, like this:

{'chunks': [
    {'text': 'First segment', 'timestamp': (0.0, 5.2)},
    {'text': 'Second segment', 'timestamp': (5.2, 12.8)},
    {'text': 'Later segment', 'timestamp': (28.4, 30.0)},
    {'text': 'Continues properly', 'timestamp': (30.0, 34.6)},  # Should continue
    ...
]}

Environment

Python 3.10
transformers 4.36.2
torch 2.1.2
Model: distil-whisper-large-v3

How can I fix this timestamp reset issue? Is there a way to make the timestamps continue incrementing throughout the entire audio file?

Timestamps reset every 30 seconds when using distil-whisper with return_timestamps=True

Problem

Output

Expected Behavior

Environment

Answers (0)

Related Questions