Reputation: 1219
Well, this post is the reason for Why is memcpy() and memmove() faster than pointer increments?
I have a custom image structure with 4 channels and I like to extract 3 of them (RGB) into an openCV cv::MAt strcuture with 3 channels.
cv::Mat GetColorImgFromFrame(Image* pimg)
{
int iHeight = pimg->m_imageInfo.m_height;
int iWidth = pimg->m_imageInfo.m_width;
cv::Mat cvmColorImg = cv::Mat::zeros(iHeight, iWidth, CV_8UC3);
int iDestStep = cvmColorImg.channels(); // 3
uchar* pucSrc = pimg->m_data;
uchar* pucDest = cvmColorImg.data;
uchar* pucDestLimit = cvmColorImg.data + iHeight*iWidth*iDestStep;
for (;pucDest < pucDestLimit;)
{
*(pucDest++) = *(pucSrc++);
*(pucDest++) = *(pucSrc++);
*(pucDest++) = *(pucSrc++);
pucSrc++;
}
return cvmColorImg;
}
It doesn't seem much helpful to replace the three inner assignments with memcpy(pucDest, pucSrc, 3). Any suggestions?
Upvotes: 1
Views: 269
Reputation: 33116
Expanding on Doc Brown's answer, unfold the loop so that you are writing three uint32_t
integers for every four read (uint64_t
will take even bigger gulps but will be a bit harder to deal with).
Reading and writing ints may be problematic if your pimg->m_data
and cvmColorImg.data
do not start on a uint32_t
boundary. You can force this to be the case by putting an int
or unsigned int
data member before these char*
data members. Or you can use something akin to Duff's device. Personal preference: I would just force the alignment. It makes the copy code cleaner, easier, and perhaps faster. A few wasted bytes for padding is a tiny cost.
As is typically the case for unrolled loops, you'll have to do something special to handle the end. (Duff's device handles the oddity at the start, so no go with Duff's device here.) A lot of the problems with the end of loop processing will go away if you pad the end of the output buffer so that it occupies 4*N bytes.
Upvotes: 1
Reputation: 15406
Before going into "assembler like" mode by using SIMD intrinsic, you can try to tell your compiler that the source and destination pointer do not alias, using for example the __restrict__
keyword with gcc or __restrict
with visual studio
Upvotes: 3
Reputation: 20044
On a 32 bit machine, you could try to copy 4 bytes at once per loop from source to destination using unsigned int *
. The trick is to keep pucDest
an uchar
pointer incremented by 3. You have deal with the the last word as a special case, not to violate the array boundary of the destination array for the last byte.
Since copying an int is typically not slower than copying a char on most machines, I suspect this will give you roughly a speed increase of a factor 3.
Upvotes: 3