ruhig brauner
ruhig brauner

Reputation: 963

Realtime audio application, improving performance

I am currently writing a C++ real time audio application which roughly contains:

I think this should be handable for my PC but I get some buffer underflows every so often so I would like to improve the performance of my application. I have a bunch of question I hope you can answer me. :)

1) Operator Overloading

Instead of working directly with my flaot samples and doing calculations for every sample, I pack my floats in a Frame class which contains the left and the right Sample. The class overloads some operators for addition, subtraction and multiplication with float.

The filters (biquad mostly) and the reverb works with floats and doesn't use this class but the hermite interpolator and every multiplication and addition for volume controll and mixing uses the class.

Does this has an impact on the performance and would it be better to work with left and right sample directly?

2) std::function

The callback function from the audio IO libary PortAudio calls a std::function. I use this to encapsulation everything related to PortAudio. So the "user" sets his own callback function with std::bind

std::bind(  &AudioController::processAudio, 
            &(*this), 
            std::placeholders::_1, 
            std::placeholders::_2));

Since for every callback, the right function has to be found from the CPU (however this works...), does this have an impact and would it be better to define a class the user has to inherit from?

3) virtual functions

I use a class called AudioProcessor which declares a virtual function:

virtual void tick(Frame *buffer, int frameCout) = 0;

This function always processes a number of frames at once. Depending on the drive, 200 frames up to 1000 frames per call. Within the signal processing path, I call this function 6 time from multiple derivated classes. I remember that this is done with lookup tables so the CPU knows exactly which function it has to call. So does the process of calling a "virtual" (derivated) function has an impact on the performance?

The nice thing about this is the structure in the source code but only using inlines maybe would have an performance improvement.

These are all questions for now. I have some more about Qt's event loop because I think that my GUI uses quite a bit of CPU time as well. But this is another topic I guess. :)

Thanks in advance!


These are all relevant function calls within the signal processing. Some of them are from the STK libary. The biquad functions are from STK and should perform fine. This goes for the freeverb algorithm as well.

// ################################ AudioController Function ############################
void AudioController::processAudio(int frameCount, float *output) {
    // CALCULATE LEFT TRACK

    Frame * leftFrameBuffer = (Frame*) output;

    if(leftLoaded) { // the left processor is loaded
        leftProcessor->tick(leftFrameBuffer, frameCount);   //(TrackProcessor::tick()
    } else {
        for(int i = 0; i < frameCount; i++) {
            leftFrameBuffer[i].leftSample  = 0.0f;
            leftFrameBuffer[i].rightSample = 0.0f;
        }
    }

    // CALCULATE RIGHT TRACk

    if(rightLoaded) { // the right processor is loaded
        // the rightFrameBuffer is allocated once and ensured to have enough space for frameCount Frames
        rightProcessor->tick(rightFrameBuffer, frameCount); //(TrackProcessor::tick()
    } else {
        for(int i = 0; i < frameCount; i++) {
            rightFrameBuffer[i].leftSample  = 0.0f;
            rightFrameBuffer[i].rightSample = 0.0f;
        }
    }

    // MIX
    for(int i = 0; i < frameCount; i++ ) {
        leftFrameBuffer[i] = volume * (leftRightMix * leftFrameBuffer[i] + (1.0 - leftRightMix) * rightFrameBuffer[i]);
    }
}

// ################################ AudioController Function ############################

void TrackProcessor::tick(Frame *frames, int frameNum) {
    if(bufferLoaded && playback) {
        for(int i = 0; i < frameNum; i++) {
            // read from buffer
            frames[i] =  bufferPlayer->tick();

            // filter coeffs
            caltulateFilterCoeffs(lowCutoffFilter->tick(), highCutoffFilter->tick());

            // filter
            frames[i].leftSample = lpFilterL->tick(hpFilterL->tick(frames[i].leftSample));
            frames[i].rightSample = lpFilterR->tick(hpFilterR->tick(frames[i].rightSample));
        }
    } else {
        for(int i = 0; i < frameNum; i++) {         
            frames[i] = Frame(0,0);
        }
    }

    // Effect 1, Equalizer
    if(effsActive[0]) {
        insEffProcessors[0]->tick(frames, frameNum);
    }
    // Effect 2, Reverb
    if(effsActive[1]) {
        insEffProcessors[1]->tick(frames, frameNum);
    }

    // Volume
    for(int i = 0; i < frameNum; i++) {
        frames[i].leftSample  *= volume;
        frames[i].rightSample *= volume;
    }
}

// ################################ Equalizer ############################

void EqualizerProcessor::tick(Frame *frames, int frameNum) {
    if(active) {
        Frame lowCross;
        Frame highCross;

        for(int f = 0; f < frameNum; f++) {

            lowAmp = lowAmpFilter->tick();
            midAmp = midAmpFilter->tick();
            highAmp = highAmpFilter->tick();

            lowCross =  highLPF->tick(frames[f]);
            highCross = highHPF->tick(frames[f]);

            frames[f] = lowAmp * lowLPF->tick(lowCross) 
                      + midAmp * lowHPF->tick(lowCross) 
                      + highAmp * lowAPF->tick(highCross);
        }
    }
}

// ################################ Reverb ############################
// This function just calls the stk::FreeVerb tick function for every frame
// The FreeVerb implementation can't realy be optimised so I will take it as it is.

void ReverbProcessor::tick(Frame *frames, int frameNum) {
    if(active) {
        for(int i = 0; i < frameNum; i++) {
            frames[i].leftSample = reverb->tick(frames[i].leftSample, frames[i].rightSample);
            frames[i].rightSample = reverb->lastOut(1);
        }
    }
}

// ################################ Buffer Playback (BufferPlayer) ############################

Frame BufferPlayer::tick() {
    // adjust read position based on loop status
    if(inLoop) {
        while(readPos > loopEndPos) {
            readPos = loopStartPos + (readPos - loopEndPos); 
        }
    }

    int x1  = readPos;
    float t = readPos - x1;

    Frame f = interpolate(buffer->frameAt(x1-1), 
                          buffer->frameAt(x1),
                          buffer->frameAt(x1+1),
                          buffer->frameAt(x1+2),
                          t);

    readPos += stepSize;;
    return f;
}

// interpolation:
Frame BufferPlayer::interpolate(Frame x0, Frame x1, Frame x2, Frame x3, float t) {
    Frame c0 = x1;
    Frame c1 = 0.5f * (x2 - x0);
    Frame c2 = x0 - (2.5f * x1) + (2.0f * x2) - (0.5f * x3);
    Frame c3 = (0.5f * (x3 - x0)) + (1.5f * (x1 - x2));
    return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}


inline Frame BufferPlayer::frameAt(int pos) {
    if(pos < 0) {
        pos = 0;
    } else if (pos >= frames) {
        pos = frames -1;
    }

    // get chunk and relative Sample
    int chunk = pos/ChunkSize;
    int chunkSample = pos%ChunkSize;

    return Frame(leftChunks[chunk][chunkSample], rightChunks[chunk][chunkSample]); 
}

Upvotes: 1

Views: 1477

Answers (1)

Thomas Matthews
Thomas Matthews

Reputation: 57728

Some suggestions on performance improvement:

Optimize Data Cache Usage

Review your functions that operate on a lot of data (e.g. arrays). The functions should load data into cache, operate on the data, then store back into memory.

The data should be organized to best fit into the data cache. Break up the data into smaller blocks if it doesn't fit. Search the web for "data driven design" and "cache optimizations".

In one project, performing data smoothing, I changed the layout of data and gained 70% performance.

Use Multiple Threads

In the big picture, you may be able to use at least three dedicated threads: input, processing and output. The input thread obtains the data and stores it in buffer(s); search the Web for "double buffering". The second thread gets data from the input buffer, processes it, then writes to an output buffer. The third thread writes data from the output buffer to the file.

You may also benefit from using threads for left and right samples. For example, while one thread is processing the left sample, another thread could be processing the right sample. If you could put the threads on different cores, you may see even more performance benefit.

Use the GPU processing

A lot of modern Graphics Processing Units (GPU) have a lot of cores that can process floating point values. Maybe you could delegate some of the filtering or analysis functions to the cores in the GPU. Be aware that this requires overhead and to gain the benefit, the processing part should be more computative than the overhead.

Reducing the Branching

Processors prefer to manipulate data over branching. Branching stalls the execution as the processor has to figure out where to get and process the next instruction. Some have large instruction caches that can contain small loops; but there is still a penalty for branching to the top of the loop again. See "Loop Unrolling". Also check your compiler optimizations and optimize high for performance. Many compilers will switch to loop unrolling for you, if the circumstances are correct.

Reduce the Amount of Processing

Do you need to process the entire sample or portions of it? For example, in video processing, much of the frame doesn't change only small portions. So the entire frame doesn't need to be processed. Can the audio channels be isolated so only a few channels are processed rather than the entire spectrum?

Coding to Help the Compiler Optimize

You can help the compiler with optimizations by using the const modifier. The compiler may be able to use different algorithms for variables that don't change versus ones that do. For example, a const value can be placed in the executable code, but a non-const value must be placed in memory.

Using static and const can help too. The static usually implies only one instance. The const implies something that doesn't change. So if there is only one instance of the variable that doesn't change, the compiler can place it into the executable or read-only memory and perform a higher optimization of the code.

Loading multiple variables at the same time can help too. The processor can place the data into the cache. The compiler may be able to use specialized assembly instructions for fetching sequential data.

Upvotes: 4

Related Questions