gorill
gorill

Reputation: 1673

Improve OpenMP/SSE parallelization effect

I'm tried to improve performance in some routine via OpenMP(parallel for) and SSE intrinsics:

void Tester::ProcessParallel()//ProcessParallel is member of Tester class
{
    //Initialize
    auto OutMapLen      = this->_OutMapLen;
    auto KernelBatchLen = this->_KernelBatchLen;
    auto OutMapHeig     = this->_OutMapHeig;
    auto OutMapWid      = this->_OutMapWid;
    auto InpMapWid      = this->_InpMapWid;
    auto NumInputMaps   = this->_NumInputMaps;
    auto InpMapLen      = this->_InpMapLen;
    auto KernelLen      = this->_KernelLen;
    auto KernelHeig     = this->_KernelHeig;
    auto KernelWid      = this->_KernelWid;
    auto input_local    = this->input;
    auto output_local   = this->output;
    auto weights_local  = this->weights;
    auto biases_local   = this->biases;
    auto klim           = this->_klim;

    #pragma omp parallel for firstprivate(OutMapLen,KernelBatchLen,OutMapHeig,OutMapWid,InpMapWid,NumInputMaps,InpMapLen,KernelLen,KernelHeig,KernelWid,input_local,output_local,weights_local,biases_local,klim)
    for(auto i=0; i<_NumOutMaps; ++i)
    {   
        auto output_map   = output_local  + i*OutMapLen;
        auto kernel_batch = weights_local + i*KernelBatchLen;
        auto bias = biases_local + i;
        for(auto j=0; j<OutMapHeig; ++j)
        {
            auto output_map_row = output_map + j*OutMapWid;
            auto inp_row_idx = j*InpMapWid;
            for(auto k=0; k<OutMapWid; ++k)
            {
                auto output_nn = output_map_row + k;
                *output_nn     = *bias;
                auto inp_cursor_idx = inp_row_idx + k;
                for(int _i=0; _i<NumInputMaps; ++_i)
                {
                    auto input_cursor = input_local + _i*InpMapLen + inp_cursor_idx;
                    auto kernel = kernel_batch + _i*KernelLen;
                    for(int _j=0; _j<KernelHeig; ++_j)
                    {
                        auto kernel_row_idx  = _j*KernelWid;
                        auto inp_row_cur_idx = _j*InpMapWid;
                        int _k=0;
                        for(; _k<klim; _k+=4)//unroll and vectorize
                        {
                            float buf;
                            __m128 wgt = _mm_loadu_ps(kernel+kernel_row_idx+_k);
                            __m128 inp = _mm_loadu_ps(input_cursor+inp_row_cur_idx+_k);
                            __m128 prd = _mm_dp_ps(wgt, inp, 0xf1);
                            _mm_store_ss(&buf, prd);
                            *output_nn += buf;
                        }
                        for(; _k<KernelWid; ++_k)//residual loop
                            *output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
                    }
                }
            }
        }
    }
}

Pure unrolling and SSE-vectorization (without OpenMP) of last nested loop improves total performance ~1.3 times - it's pretty nice result. Howewer, pure OpenMP parallelization (without unrolling/vectorization) of external loop gives only ~2.1 performance gain on 8-core processor (core i7 2600K). In total, both SSE vectorization and OpenMP parallel_for shows 2.3-2.7 times performance gain. How can I boost OpenMP parallelization effect in the code above?

Interesting: if replace "klim" variable - bound in unrolling last loop - with scalar constant, say, 4, total performance gain rises to 3.5.

Upvotes: 2

Views: 814

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74395

Vectorisation and threading do not work orthogonally (in respect to speeding up the calculations) in most cases, i.e. their speed-ups do not necessarily add up. What's worse is that this happens mostly in cases like yours, where data is being processed in a streaming fashion. The reason for that is simple - finite memory bandwidth. A very simple measure of whether this is the case is the so-called computational intensity (CI), defined as the amount of data processing (usually in FLOPS) performed over a byte of input data. In your case you load two XMM registers, which makes 32 bytes of data in total, then perform one dot product operation. Let's have your code running on a 2 GHz Sandy Bridge CPU. Although DPPS takes full 12 cycles to complete on SNB, the CPU is able to overlap several such instructions and retire one every 2 cycles. Therefore at 2 GHz each core could perform 1 billion dot products per second in a tight loop. It would require 32 GB/s of memory bandwidth to keep such a loop busy. The actual bandwidth needed in your case is less since there are other instructions in the loop, but still the main idea remains - the processing rate of the loop is limited by the amount of data that the memory is able to feed to the core. As long as all the data fits into the last-level cache (LLC), performance would more or less scale with the number of threads as the LLC usually provides fairly high bandwidth (e.g. 300 GB/s on Xeon 7500's as stated here). This is not the case once data grows big enough not to fit into the cache as the main memory usually provides an order of magnitude less bandwidth per memory controller. In the latter case all cores have to share the limited memory speed and once it is saturated, adding more threads would not result in increase of the speed-up. Only adding more bandwidth, e.g. having a system with several CPU sockets, would result in an increased processing speed.

There is a theoretical model, called the Roofline model, that captures this in a more formal way. You can see some explanations and applications of the model in this presentation.

The bottom line is: both vectorisation and multiprocessing (e.g. threading) increase the performance but also increase the memory pressure. As long as the memory bandwidth is not saturated, both result in increased processing rate. Once the memory becomes the bottleneck, performance does not increase any more. There are even cases when multithreaded performance drops because of the additional pressure put by vectorisation.

Possibly an optimisation hint: the store to *output_nn might not get optimised since output_nn ultimately points inside a shared variable. Therefore you might try something like:

for(auto k=0; k<OutMapWid; ++k)
{
    auto output_nn = output_map_row + k;
    auto _output_nn = *bias;
    auto inp_cursor_idx = inp_row_idx + k;
    for(int _i=0; _i<NumInputMaps; ++_i)
    {
        ...
        for(int _j=0; _j<KernelHeig; ++_j)
        {
            ...
            for(; _k<klim; _k+=4)//unroll and vectorize
            {
                ...
                _output_nn += buf;
            }
            for(; _k<KernelWid; ++_k)//residual loop
                _output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
        }
    }
    *output_nn = _output_nn;
}

But I guess your compiler is smart enough to figure it by itself. Anyway, this would only matter in the single-threaded case. Once you are into the saturated memory bandwidth region, no such optimisations would matter.

Upvotes: 1

Related Questions