Reputation: 3570
Using the FFMpeg library in my Android app, I try to understand how I can seek in an audio file, at a very precise position.
For example, I want to set the current position in my file to the frame #1234567 (in a file encoded at 44100 Hz), which is equivalent to seek at 27994.717 milliseconds.
To achieve that, here is what I tried:
// this:
av_seek_frame(formatContext, -1, 27994717, 0);
// or this:
av_seek_frame(formatContext, -1, 27994717, AVSEEK_FLAG_ANY);
// or even this:
avformat_seek_file(formatContext, -1, 27994617, 27994717, 27994817, 0);
Using a position in microseconds gives me the best result so far.
But for some reason, the positioning is not totally accurate: when I extract the samples from the audio file, it doesn't start exactly at the expected position. There is a slight delay of about 30-40 milliseconds (even if I seek to the position 0, surprisingly...).
Do I use the function the right way, or even the right function?
EDIT
Here is how I can get the position:
AVPacket packet;
AVStream *stream = NULL;
AVFormatContext *formatContext = NULL;
AVCodec *dec = NULL;
// initialization:
avformat_open_input(&formatContext, filename, NULL, NULL);
avformat_find_stream_info(formatContext, NULL);
int audio_stream_index = av_find_best_stream(formatContext, AVMEDIA_TYPE_AUDIO, -1, -1, &dec, 0);
stream = formatContext->streams[audio_stream_index];
...
// later, when I extract samples, here is how I get my position, in microseconds:
av_read_frame(formatContext, &packet);
long position = (long) (1000000 * (packet.pts * ((float) stream->time_base.num / stream->time_base.den)));
Thanks to that piece of code, I can get the position of the beginning of the current frame (frame = bloc of samples, the size depends on the audio format - 1152 samples for mp3, 128 to 1152 for ogg, ...)
The problem is: the value I get in position
is not accurate: it's actually 30 ms late, approximately. For example, when it says 1000000, the actual position is approximately 1030000...
What did I do wrong? Is it a bug in FFMpeg?
Thanks for your help.
Upvotes: 10
Views: 3932
Reputation: 2151
Combining small pieces of information from the internet about this topic,
I've managed to seek precisely using this technique (tested in an android app only).
// --> Setup
AVFormatContext* formatContext;
AVCodecContext* codecContext;
// ...
int sample_rate = 44100;
int channel_count = 2;
// --> Do seek
bool seek_pending = true;
double seek_target_s = 4.5;
// ...
// --> Decode
bool is_eof = false;
int result = -1;
AVPacket* audioPacket = av_packet_alloc();
AVFrame* audioFrame = av_frame_alloc();
while (!is_eof) {
if (seek_pending) {
int64_t seek_ts = seek_target_s * AV_TIME_BASE - 0.05 * AV_TIME_BASE;
if (seek_ts < 0) seek_ts = 0;
avformat_seek_file(formatContext, -1, 0, seek_ts, INT64_MAX, 0);
avcodec_flush_buffers(codecContext);
}
av_read_frame(formatContext, audioPacket);
avcodec_send_packet(codecContext, audioPacket);
av_packet_unref(audioPacket);
while (result >= 0) {
result = avcodec_receive_frame(codecContext, audioFrame);
if (result < 0) continue;
AVRational audio_time_base = (AVRational){1, sample_rate * channel_count};
audioFrame->pts = av_rescale_q(audioFrame->pts, codecContext->pkt_timebase, audio_time_base);
audio_pts = audioFrame->pts - delayedSamples * channel_count;
int skip_samples = 0;
if (seek_pending) {
int64_t seek_pts = seek_target_s * channel_count * sample_rate;
int64_t next_pts = audio_pts + audioFrame->nb_samples * channel_count;
if (next_pts < seek_pts) {
av_frame_unref(audioFrame);
continue;
}
else {
skip_samples = seek_pts - audio_pts;
}
}
seek_pending = false;
int samples_count = audioFrame->nb_samples * channel_count;
process_frame(audioFrame->data, samples_count, skip_samples);
av_frame_unref(audioFrame);
}
}
It was tested on mp3 files. I omit error checking and resampling for simplicity.
Code details:
seek_target_s
is in seconds, just to avoid dividing it by 1000.0audioFrame->pts
to samples, as well as other PTS related values, audio_pts
, seek_pts
, next_pts
audioFrame->nb_samples
is the total number of audio samples in a frame divided by number of channels,
so I multiply some values by channel_count
but it's just simpler for me to read the code)The main idea is to seek to 50ms before the target time and then search to the exact frame and sample that matches the target time.
The if (next_pts < seek_pts)
part skips frames that don't "contain" the target time.
When the correct frame is found, the else
part checks how many samples away the target time is from the frame start.
Then process_frame
is a custom method that does something with the audio data (write to a stream or a file for example).
samples_count
is the total number of audio samples to process (each sample is a float or signed 16-bit PCM value depending on the format).
skip_samples
is how many samples to omit from those samples_count
from the data beginning (will be processed the last samples_count - skip_samples
values).
Some details on my investigation:
I needed an audio loop with exact time bounds in milliseconds.
When passing 0
as seek flags, I heard a small chunk of silence at the beginning of the loop, maybe a couple of ms.
Though by listening to the loop I'd say the seek was exact, it just lost some initial audio data.
So I've printed the frame data to the log just after the seek and saw it had 0x00
bytes in ~1.5 initial frames.
Then I tried to seek using av_seek_frame
and AVSEEK_FLAG_BACKWARD
flag.
It was a bit better as the silence at the loop beginning was shorter, but it was still there.
So my final "dirty" solution was to go a bit further then with the AVSEEK_FLAG_BACKWARD
flag and rewind to an arbitrary constant and small amount of time (50ms was enough for me).
Upvotes: 0
Reputation: 327
Late, but hopefully, it helps someone. The idea is to save timestamp when seeking and then compare AVPacket->pts with this value (You can do that with AVStream->dts, but it wasn't giving good results in my experiments). If pts is still lower than our target timestamp, then skip frames using AV_PKT_DATA_SKIP_SAMPLES ability of AVPacket->side_data.
Code for seeking method:
void audio_decoder::seek(float seconds) {
auto stream = m_format_ctx->streams[m_packet->stream_index];
// convert seconds provided by the user to a timestamp in a correct base,
// then save it for later.
m_target_ts = av_rescale_q(seconds * AV_TIME_BASE, AV_TIME_BASE_Q, stream->time_base);
avcodec_flush_buffers(m_codec_ctx.get());
// Here we seek within given stream index and the correct timestamp
// for that stream. Using AVSEEK_FLAG_BACKWARD to make sure we're
// always *before* requested timestamp.
if(int err = av_seek_frame(m_format_ctx.get(), m_packet->stream_index, m_target_ts, AVSEEK_FLAG_BACKWARD)) {
error("audio_decoder: Error while seeking ({})", av_err_str(err));
}
}
And code for decoding method:
void audio_decoder::decode() {
<...>
while(is_decoding) {
// Read data as usual.
av_read_frame(m_format_ctx.get(), m_packet.get());
// Here is the juicy part. We were seeking, but the seek
// wasn't precise enough so we need to drop some frames.
if(m_packet->pts > 0 && m_target_ts > 0 && m_packet->pts < m_target_ts) {
auto stream = m_format_ctx->streams[m_packet->stream_index];
// Conversion from delta timestamp to frames.
auto time_delta = static_cast<float>(m_target_ts - m_packet->pts) / stream->time_base.den;
int64_t skip_frames = time_delta * m_codec_ctx->time_base.den / m_codec_ctx->time_base.num;
// Next step: we need to provide side data to our packet,
// and it will tell the codec to drop frames.
uint8_t *data = av_packet_get_side_data(m_packet.get(), AV_PKT_DATA_SKIP_SAMPLES, nullptr);
if(!data) {
data = av_packet_new_side_data(m_packet.get(), AV_PKT_DATA_SKIP_SAMPLES, 10);
}
// Define parameters of side data. You can check them here:
// https://ffmpeg.org/doxygen/trunk/group__lavc__packet.html#ga9a80bfcacc586b483a973272800edb97
*reinterpret_cast<uint32_t*>(data) = skip_frames;
data[8] = 0;
}
// Send packet as usual.
avcodec_send_packet(m_codec_ctx.get(), m_packet.get());
// Proceed to the receiving frames as usual, nothing to change there.
}
<...>
}
If it's unclear without context, you can check the same code in my project audio_decoder.cpp.
Upvotes: 5
Reputation: 31101
It depends on the codec. For example aac has a resolution of 1024 samples per frame, no matter what the sample rate, it also has priming samples that may be discarded. MP3 has 576 or 1152 samples per frame depending on the layer.
If you need perfection, use an uncompressed format like wav or riff.
Upvotes: 3