Grayscale bilinear patch extraction - SSE optimization

My program makes an intensive use of small sub-images extracted using bilinear interpolation from larger grayscale images.

I am using the following function for this purpose:

bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
    const int hsize = patch.rows/2;

    // ...
    // Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers
    // ...

    int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
    if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
        return false;

    float x=patch_ctr.x-hsize-floorx;
    float y=patch_ctr.y-hsize-floory;
    float xy = x*y;
    float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
    int img_stride = img.cols-patch.cols;
    uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
    uchar* buff_img1 = buff_img0+img.cols;
    uchar* buff_patch = (uchar*)patch.data;
    for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
        for(int u=0; u<patch.cols; ++u,++buff_patch,++buff_img0,++buff_img1)
            buff_patch[0] = cv::saturate_cast<uchar>(buff_img0[0]*w00+buff_img0[1]*w01+buff_img1[0]*w10+buff_img1[1]*w11);
    }
    return true;
}

Long story short, I am already using parallelization in other parts of the program, and I am considering using SSE to optimize the execution of this function, because I am mostly using 8x8 patches and it seems like a good idea to process bunches of 8 pixels at a time using SSE.

However, I am not sure how to deal with the multiplication by the float interpolation weights (i.e. w00, w01, w10 and w11. These weights are necessarily positive and smaller than 1, hence the multiplication cannot overflow the unsigned char datatype.

Does anyone know how to proceed ?

EDIT:

I tried to do this as follows (assuming 16x16 patches), but there is no significant speed-up:

bool extract_patch_bilin_16x16(const cv::Point2f& patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
    // ...
    // Precondition checks
    // ...

    const int hsize = patch.rows/2;
    int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
    // Check that the full extracted patch is inside the image
    if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
        return false;

    // Compute the constant bilinear weights
    float x=patch_ctr.x-hsize-floorx;
    float  y=patch_ctr.y-hsize-floory;
    float  xy = x*y;
    float  w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
    // Prepare image resampling loop
    int img_stride = img.cols-patch.cols;
    uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
    uchar* buff_img1 = buff_img0+img.cols;
    uchar* buff_patch = (uchar*)patch.data;
    // Precompute weighting variables
    const __m128i CONST_0 = _mm_setzero_si128();
    __m128i w00x256_32i = _mm_set1_epi32(cvRound(w00*256));
    __m128i w01x256_32i = _mm_set1_epi32(cvRound(w01*256));
    __m128i w10x256_32i = _mm_set1_epi32(cvRound(w10*256));
    __m128i w11x256_32i = _mm_set1_epi32(cvRound(w11*256));
    __m128i w00x256_16i = _mm_packs_epi32(w00x256_32i,w00x256_32i);
    __m128i w01x256_16i = _mm_packs_epi32(w01x256_32i,w01x256_32i);
    __m128i w10x256_16i = _mm_packs_epi32(w10x256_32i,w10x256_32i);
    __m128i w11x256_16i = _mm_packs_epi32(w11x256_32i,w11x256_32i);
    // Process pixels
    int ngroups = patch.rows>>4;
    for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
        for(int g=0; g<ngroups; ++g,buff_patch+=16,buff_img0+=16,buff_img1+=16) {
                ////////////////////////////////
                // Load the data (16 pixels in one load)
                ////////////////////////////////
                __m128i val00 = _mm_loadu_si128((__m128i*)buff_img0);
                __m128i val01 = _mm_loadu_si128((__m128i*)(buff_img0+1));
                __m128i val10 = _mm_loadu_si128((__m128i*)buff_img1);
                __m128i val11 = _mm_loadu_si128((__m128i*)(buff_img1+1));
                ////////////////////////////////
                // Process the lower 8 values
                ////////////////////////////////
                // Unpack into 16-bits integers
                __m128i val00_lo = _mm_unpacklo_epi8(val00,CONST_0);
                __m128i val01_lo = _mm_unpacklo_epi8(val01,CONST_0);
                __m128i val10_lo = _mm_unpacklo_epi8(val10,CONST_0);
                __m128i val11_lo = _mm_unpacklo_epi8(val11,CONST_0);
                // Multiply with the integer weights
                __m128i w256val00_lo = _mm_mullo_epi16(val00_lo,w00x256_16i);
                __m128i w256val01_lo = _mm_mullo_epi16(val01_lo,w01x256_16i);
                __m128i w256val10_lo = _mm_mullo_epi16(val10_lo,w10x256_16i);
                __m128i w256val11_lo = _mm_mullo_epi16(val11_lo,w11x256_16i);
                // Divide by 256 to get the approximate result of the multiplication with floating-point weights
                __m128i wval00_lo = _mm_srli_epi16(w256val00_lo,8);
                __m128i wval01_lo = _mm_srli_epi16(w256val01_lo,8);
                __m128i wval10_lo = _mm_srli_epi16(w256val10_lo,8);
                __m128i wval11_lo = _mm_srli_epi16(w256val11_lo,8);
                // Add pairwise
                __m128i sum0_lo = _mm_add_epi16(wval00_lo,wval01_lo);
                __m128i sum1_lo = _mm_add_epi16(wval10_lo,wval11_lo);
                __m128i final_lo = _mm_add_epi16(sum0_lo,sum1_lo);
                ////////////////////////////////
                // Process the higher 8 values
                ////////////////////////////////
                // Unpack into 16-bits integers
                __m128i val00_hi = _mm_unpackhi_epi8(val00,CONST_0);
                __m128i val01_hi = _mm_unpackhi_epi8(val01,CONST_0);
                __m128i val10_hi = _mm_unpackhi_epi8(val10,CONST_0);
                __m128i val11_hi = _mm_unpackhi_epi8(val11,CONST_0);
                // Multiply with the integer weights
                __m128i w256val00_hi = _mm_mullo_epi16(val00_hi,w00x256_16i);
                __m128i w256val01_hi = _mm_mullo_epi16(val01_hi,w01x256_16i);
                __m128i w256val10_hi = _mm_mullo_epi16(val10_hi,w10x256_16i);
                __m128i w256val11_hi = _mm_mullo_epi16(val11_hi,w11x256_16i);
                // Divide by 256 to get the approximate result of the multiplication with floating-point weights
                __m128i wval00_hi = _mm_srli_epi16(w256val00_hi,8);
                __m128i wval01_hi = _mm_srli_epi16(w256val01_hi,8);
                __m128i wval10_hi = _mm_srli_epi16(w256val10_hi,8);
                __m128i wval11_hi = _mm_srli_epi16(w256val11_hi,8);
                // Add pairwise
                __m128i sum0_hi = _mm_add_epi16(wval00_hi,wval01_hi);
                __m128i sum1_hi = _mm_add_epi16(wval10_hi,wval11_hi);
                __m128i final_hi = _mm_add_epi16(sum0_hi,sum1_hi);
                ////////////////////////////////
                // Repack all values
                ////////////////////////////////
                __m128i final_val = _mm_packus_epi16(final_lo,final_hi);
                _mm_storeu_si128((__m128i*)buff_patch,final_val);
        }
    }
}

Any idea what could be done to improve the speed-up ?

Upvotes: 0

Answers (1)

user1196549

Reputation:

I would consider sticking to integers: your weights are multiples of 1/64 so that working with fixed-point 8.6 is enough and that fits in 16 bits numbers.

Bilinear interpolation is best done as three linear ones (two on Y then one on X; you can reuse the second Y interpolation for the neighboring patch).

To perform a linear interpolation between two values, you will pre-store once for all the interpolation weights P and Q (8 to 1 and 0 to 7), and multiply and add them in pairs like V0.P[i]+V1.Q[i]. This is efficiently done using the PMADDUBSW instruction. (After appropriate data interleaving, and replication of the values V0 and V1, with PUNPCKLBW and the like).

In the end, divide by the total weight (PSRLW), rescale to bytes (PACKUSWB). (This step can be performed once only, combining the two interpolations.)

You could think of doubling all weights, so that the final scaling is by 8 bits, and PACKUSWB would suffice, but unfortunately it saturates the values and there is no unsaturated equivalent.

It could be that precomputing all 64 interpolation weights and summing the four bilinear terms is better.

UPDATE:

If the goal is to interpolate with fixed coefficients for all pixels quads (actually achieving subpixel translation), the strategy is different.

You will load a run of 8 (16 ?) pixels corresponding to the upper-left corners, a run of 8 shifted one pixel to the right (corresponding to the upper-right corners), and similarly for the next row (bottom coners); multiply and add in pairs (PMADDUBSW) the pixel values to the corresponding interpolation weights, and combine the pairs (PADDW). Store the weights with replication.

Another option will be to avoid the (PMADD) and perform separate multiplies (PMULLW) and adds (PADDW). This will simplify the reorganization scheme.

After scaling (as above), you end up with a run of 8 interpolated values.

This can work as well for variable interpolation weights, as long as you interpolate exactly one pixel per quad.

Upvotes: 2

Grayscale bilinear patch extraction - SSE optimization

Answers (1)

Related Questions