Reputation: 1219

how do I optimize this loop further?

Well, this post is the reason for Why is memcpy() and memmove() faster than pointer increments?

I have a custom image structure with 4 channels and I like to extract 3 of them (RGB) into an openCV cv::MAt strcuture with 3 channels.

cv::Mat GetColorImgFromFrame(Image* pimg)
{
    int iHeight = pimg->m_imageInfo.m_height;
    int iWidth = pimg->m_imageInfo.m_width;

    cv::Mat cvmColorImg = cv::Mat::zeros(iHeight, iWidth, CV_8UC3);
    int iDestStep = cvmColorImg.channels(); // 3

    uchar* pucSrc = pimg->m_data;    
    uchar* pucDest = cvmColorImg.data;
    uchar* pucDestLimit = cvmColorImg.data + iHeight*iWidth*iDestStep;

    for (;pucDest < pucDestLimit;)
    {       
        *(pucDest++) = *(pucSrc++);
        *(pucDest++) = *(pucSrc++);
        *(pucDest++) = *(pucSrc++);
        pucSrc++;
    }

    return cvmColorImg;
}

It doesn't seem much helpful to replace the three inner assignments with memcpy(pucDest, pucSrc, 3). Any suggestions?

Upvotes: 1

Answers (3)

David Hammen

Reputation: 33136

Expanding on Doc Brown's answer, unfold the loop so that you are writing three uint32_t integers for every four read (uint64_t will take even bigger gulps but will be a bit harder to deal with).

Reading and writing ints may be problematic if your pimg->m_data and cvmColorImg.data do not start on a uint32_t boundary. You can force this to be the case by putting an int or unsigned int data member before these char* data members. Or you can use something akin to Duff's device. Personal preference: I would just force the alignment. It makes the copy code cleaner, easier, and perhaps faster. A few wasted bytes for padding is a tiny cost.

As is typically the case for unrolled loops, you'll have to do something special to handle the end. (Duff's device handles the oddity at the start, so no go with Duff's device here.) A lot of the problems with the end of loop processing will go away if you pad the end of the output buffer so that it occupies 4*N bytes.

Upvotes: 1

shodanex

Reputation: 15456

Before going into "assembler like" mode by using SIMD intrinsic, you can try to tell your compiler that the source and destination pointer do not alias, using for example the __restrict__ keyword with gcc or __restrict with visual studio

Upvotes: 3

Doc Brown

Reputation: 20054

On a 32 bit machine, you could try to copy 4 bytes at once per loop from source to destination using unsigned int *. The trick is to keep pucDest an uchar pointer incremented by 3. You have deal with the the last word as a special case, not to violate the array boundary of the destination array for the last byte.

Since copying an int is typically not slower than copying a char on most machines, I suspect this will give you roughly a speed increase of a factor 3.

Upvotes: 3

how do I optimize this loop further?

Answers (3)

Related Questions