De-quantising audio with ffmpeg

Question

I am using FFmpeg library to decode and (potentially) modify some audio.

I managed to use the following functions to iterate through all frames of the audio file:

avformat_open_input // Obtains formatContext
avformat_find_stream_info
av_find_best_stream // The argument AVMEDIA_TYPE_AUDIO is fed in to find the audio stream
avcodec_open2 // Obtains codecContext
av_init_packet

// The following is used to loop through the frames
av_read_frame
avcodec_decode_audio4

In the end, I have these three values available on each iteration

int dataSize; // return value of avcodec_decode_audio4
AVFrame* frame;
AVCodecContext* codecContext; // Codec context of the best stream

I supposed that a loop like this can be used to iterate over all samples:

for (int i = 0; i < frame->nb_samples; ++i)
{
    // Bytes/Sample is known to be 4
    // Extracts audio from Channel 1. There are in total 2 channels.
    int* sample = (int*)frame->data[0] + dataSize * i;
    // Now *sample is accessible
}

However, when I plotted the data using gnuplot, I did not get a waveform as expected, and some of the values reached the the limit of 32 bits integers: (The audio stream is not silent in the first few seconds)

I suppose that some form of quantisation is going on to prevent the data from being interpreted mathematically. What should I do to de-quantise this?

Ronald S. Bultje · Accepted Answer

for (int i = 0; i < frame->nb_samples; ++i)
{
    // Bytes/Sample is known to be 4
    // Extracts audio from Channel 1. There are in total 2 channels.
    int* sample = (int*)frame->data[0] + dataSize * i;
    // Now *sample is accessible
}

Well... No. So, first of all, we'll need to know the data type. Check frame->format. It's an enum AVSampleFormat, most likely flt, fltp, s16 or s16p.

So, how do you interpret frame->data[] given the format? Well, first, is it planar or not? If it's planar, it means each channel is in frame->data[n], where n is the channel number. frame->channels is the number of channels. If it's not planar, it means all data is interleaved (per sample) in frame->data[0].

Second, what is the storage type? If it's s16/s16p, it's int16_t *. If it's flt/fltp, it's float *. So the correct interpretation for fltp would be:

for (int c = 0; c < frame->channels; c++) {
    float *samples = frame->data[c];
    for (int i = 0; i < frame->nb_samples; i++) {
        float sample = samples[i];
        // now this sample is accessible, it's in the range [-1.0, 1.0]
    }
}

Whereas for s16, it would be:

int16_t *samples = frame->data[0];
for (int c = 0; c < frame->channels; c++) {
    for (int i = 0; i < frame->nb_samples; i++) {
        int sample = samples[i * frame->channels + c];
        // now this sample is accessible, it's in the range [-32768,32767]
    }
}

De-quantising audio with ffmpeg

Answers (1)

Related Questions