Lutando
Lutando

Reputation: 5010

Get waveform data from audio file using FFMPEG

I am writing an application that needs to get the raw waveform data of an audio file so I can render it in an application (C#/.NET). I am using ffmpeg to offload this task but it looks like ffmpeg can only output the waveform data as a png or as a stream to gnuplot.

I have looked at other libraries to do this (NAudio/CSCore) however they require windows/microsoft media foundation and since this app is going to be deployed to azure as a web app I can not use them.

My strategy was to just read the waveform data from the png itself but this seems hacky and over the top. The ideal output would be a fix sampled series of peaks in an array where each value in the array is the peak value (ranging from 1-100 or something, like this for example).

Upvotes: 5

Views: 8507

Answers (2)

Mathieu Rodic
Mathieu Rodic

Reputation: 6762

You can use the function described in this tutorial to get the raw data decoded from an audio file as an array of double values.

Summarizing from the link:

The function decode_audio_file takes 4 parameters:

  • path: the path of the file to decode
  • sample_rate: the desired sample rate for the output data
  • data: a pointer to a pointer to double precision values, where the extracted data will be stored
  • size: a pointer to the length of the final extracted values array (number of samples)

It returns 0 upon success, and -1 in case of failure, assorted with error message written to the stderr stream.

The function code is below:

#include <stdio.h>
#include <stdlib.h>
 
#include <libavutil/opt.h>
#include <libavcodec/avcodec.h>
#include <libavformat/avformat.h>
#include <libswresample/swresample.h>
 
 
int decode_audio_file(const char* path, const int sample_rate, double** data, int* size) {
 
    // initialize all muxers, demuxers and protocols for libavformat
    // (does nothing if called twice during the course of one program execution)
    av_register_all();
 
    // get format from audio file
    AVFormatContext* format = avformat_alloc_context();
    if (avformat_open_input(&format, path, NULL, NULL) != 0) {
        fprintf(stderr, "Could not open file '%s'\n", path);
        return -1;
    }
    if (avformat_find_stream_info(format, NULL) < 0) {
        fprintf(stderr, "Could not retrieve stream info from file '%s'\n", path);
        return -1;
    }
 
    // Find the index of the first audio stream
    int stream_index =- 1;
    for (int i=0; i<format->nb_streams; i++) {
        if (format->streams[i]->codec->codec_type == AVMEDIA_TYPE_AUDIO) {
            stream_index = i;
            break;
        }
    }
    if (stream_index == -1) {
        fprintf(stderr, "Could not retrieve audio stream from file '%s'\n", path);
        return -1;
    }
    AVStream* stream = format->streams[stream_index];
 
    // find & open codec
    AVCodecContext* codec = stream->codec;
    if (avcodec_open2(codec, avcodec_find_decoder(codec->codec_id), NULL) < 0) {
        fprintf(stderr, "Failed to open decoder for stream #%u in file '%s'\n", stream_index, path);
        return -1;
    }
 
    // prepare resampler
    struct SwrContext* swr = swr_alloc();
    av_opt_set_int(swr, "in_channel_count",  codec->channels, 0);
    av_opt_set_int(swr, "out_channel_count", 1, 0);
    av_opt_set_int(swr, "in_channel_layout",  codec->channel_layout, 0);
    av_opt_set_int(swr, "out_channel_layout", AV_CH_LAYOUT_MONO, 0);
    av_opt_set_int(swr, "in_sample_rate", codec->sample_rate, 0);
    av_opt_set_int(swr, "out_sample_rate", sample_rate, 0);
    av_opt_set_sample_fmt(swr, "in_sample_fmt",  codec->sample_fmt, 0);
    av_opt_set_sample_fmt(swr, "out_sample_fmt", AV_SAMPLE_FMT_DBL,  0);
    swr_init(swr);
    if (!swr_is_initialized(swr)) {
        fprintf(stderr, "Resampler has not been properly initialized\n");
        return -1;
    }
 
    // prepare to read data
    AVPacket packet;
    av_init_packet(&packet);
    AVFrame* frame = av_frame_alloc();
    if (!frame) {
        fprintf(stderr, "Error allocating the frame\n");
        return -1;
    }
 
    // iterate through frames
    *data = NULL;
    *size = 0;
    while (av_read_frame(format, &packet) >= 0) {
        // decode one frame
        int gotFrame;
        if (avcodec_decode_audio4(codec, frame, &gotFrame, &packet) < 0) {
            break;
        }
        if (!gotFrame) {
            continue;
        }
        // resample frames
        double* buffer;
        av_samples_alloc((uint8_t**) &buffer, NULL, 1, frame->nb_samples, AV_SAMPLE_FMT_DBL, 0);
        int frame_count = swr_convert(swr, (uint8_t**) &buffer, frame->nb_samples, (const uint8_t**) frame->data, frame->nb_samples);
        // append resampled frames to data
        *data = (double*) realloc(*data, (*size + frame->nb_samples) * sizeof(double));
        memcpy(*data + *size, buffer, frame_count * sizeof(double));
        *size += frame_count;
    }
 
    // clean up
    av_frame_free(&frame);
    swr_free(&swr);
    avcodec_close(codec);
    avformat_free_context(format);
 
    // success
    return 0;
 
}

You will need the following flags to compile a program that uses : -lavcodec-ffmpeg -lavformat-ffmpeg -lavutil -lswresample Depending on your system and installation, it could also be: -lavcodec -lavformat -lavutil -lswresample

and its usage is below:

int main(int argc, char const *argv[]) {
 
    // check parameters
    if (argc < 2) {
        fprintf(stderr, "Please provide the path to an audio file as first command-line argument.\n");
        return -1;
    }
 
    // decode data
    int sample_rate = 44100;
    double* data;
    int size;
    if (decode_audio_file(argv[1], sample_rate, &data, &size) != 0) {
        return -1;
    }
 
    // sum data
    double sum = 0.0;
    for (int i=0; i<size; ++i) {
        sum += data[i];
    }
 
    // display result and exit cleanly
    printf("sum is %f", sum);
    free(data);
    return 0;
}

Upvotes: 1

VC.One
VC.One

Reputation: 15871

Sabona budi,

Wrote about the manual way to get waveform but then to show you an example, I found this code which does what you want (or at the least, you can learn something from it).

1) Use FFmpeg to get array of samples

Try the example code shown here : http://blog.wudilabs.org/entry/c3d357ed/?lang=en-US

Experiment with it, try tweaking with suggestions from manual etc... In that shown code just change string path to point to your own file-path. Edit the proc.StartInfo.Arguments section to replace the last section to look like:

proc.StartInfo.Arguments = "-i \"" + path + "\" -vn -ac 1 -filter:a aresample=myNum -map 0:a -c:a pcm_s16le -f data -";

That myNum from the part aresample=myNum is calculated by :

44100 * total Seconds = X.
myNum = X / WaveForm Width.

Finally use the ProcessBuffer function with this logic :

static void ProcessBuffer(byte[] buffer, int length)
{
    float val; //amplitude value of a sample
    int index = 0; //position within sample bytes
    int slicePos = 0; //horizontal (X-axis) position for pixels of next slice


    while (index < length)
    {
        val = BitConverter.ToInt16(buffer, index);
        index += sizeof(short);

        // use number in va to do something...
        // eg: Draw a line on canvas for part of waveform's pixels
        // eg: myBitmap.SetPixel(slicePos, val, Color.Green);

        slicePos++;
    }
}

If you want to do it manually without FFmpeg. You could try...

2) Decode audio to PCM
You could just load the audio file (mp3) into your app and first decode that to PCM (ie: raw digital audio). Then read just the PCM numbers to make the waveform. Don't read numbers directly from bytes of compression math like MP3.

These PCM data values (about audio amplitudes) go into a byte array. If your sound is 16-bit then you extract the PCM value by reading each sample as a short (ie: getting value of two consecutive bytes at once since 16 bits == 2 bytes length).

Basically when you have 16-bit audio PCM inside a byte array, every two bytes represents an audio sample's amplitude value. This value becomes your height (loudness) at each slice. A slice is a 1-pixel vertical line from a time in the waveform.

Now sample rate means how many samples per-second. Usually 44100 samples (44.1 khz). You can see that using 44 thousand pixels to represent one second of sound is not feasible, so divide total seconds by required waveform width. Take the result & multiply by 2 (to cover two bytes) and that is how you much you jump-&-sample the amplitudes as you form the waveform. Do this in a while loop.

Upvotes: 5

Related Questions