ErmIg
ErmIg

Reputation: 4038

HOG optimization with using SIMD

There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:

float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size]; 

for(size_t i = 0; i < size; ++i)
{
    histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
    histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
    histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
    histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}

There the value of size depends from implementation but in general the meaning is the same.

I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?

Upvotes: 1

Views: 425

Answers (2)

ErmIg
ErmIg

Reputation: 4038

I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):

float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size]; 
float buf[18][4];

for(size_t i = 0; i < size; ++i)
{
    buf[idx[i]][0] += val[i]*ky[y]*kx[x];
    buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
    buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
    buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}

for(size_t i = 0; i < 18; ++i)
{
    histogram[y/8][x/8][i] += buf[i][0];
    histogram[y/8][x/8 + 1][i] += buf[i][1];
    histogram[y/8 + 1][x/8][i] += buf[i][2];
    histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}

Upvotes: 1

Paul R
Paul R

Reputation: 213060

You can do a partial optimisation by using SIMD to calculate all the (flattened) histogram indices and the bin increments. Then process these in a scalar loop afterwards. You probably also want to strip-mine this such that you process one row at a time, in order to keep the temporary bin indices and increments in cache. It might appear that this would be inefficient, due to the use of temporary intermediate buffers, but in practice I have seen a useful overall gain in similar scenarios.

uint32_t i = 0;

for (y = 0; y < height; ++y)   // for each row
{
    uint32_t inds[width * 4];  // flattened histogram indices for this row
    float vals[width * 4];     // histogram bin increments for this row

    // SIMD loop for this row - calculate flattened histogram indices and bin
    // increments (scalar code shown for reference - converting this loop to
    // SIMD is left as an exercise for the reader...)

    for (x = 0; x < width; ++x, ++i)
    {
        indices[4*x]   = (y/8)*(width/8)*18+(x/8)*18+idx[i];
        indices[4*x+1] = (y/8)*(width/8)*18+(x/8 + 1)*18+idx[i];
        indices[4*x+2] = (y/8+1)*(width/8)*18+(x/8)*18+idx[i];
        indices[4*x+3] = (y/8+1)*(width/8)*18+(x/8 + 1)*18+idx[i];

        vals[4*x]   = val[i]*ky[y]*kx[x];
        vals[4*x+1] = val[i]*ky[y]*kx[x+1];
        vals[4*x+2] = val[i]*ky[y+1]*kx[x];
        vals[4*x+3] = val[i]*ky[y+1]*kx[x+1];
    }

    // scalar loop for this row

    float * const histogram_base = &histogram[0][0][0]; // pointer to flattened histogram

    for (x = 0; x < width * 4; ++x) // for each set of 4 indices/increments in this row
    {
        histogram_base[indices[x]] += vals[x];  // update the (flattened) histogram
    }

}

Upvotes: 0

Related Questions