Reputation: 1219
I just started to use SSE to optimize my code for a computer vision project, aiming at detecting skin color in an image. Below is my function. The function takes a color image and looks at each pixel and returns a probability map. The commented out code was my original C++ implementation and the rest is the SSE version. I timed both of them and it is wierd to find out SSE isn't any faster than my original C++ code. Any suggestions about what's going on or how to optimize the function further?
void EvalSkinProb(const Mat& cvmColorImg, Mat& cvmProb)
{
std::clock_t ts = std::clock();
Mat cvmHSV = Mat::zeros(cvmColorImg.rows, cvmColorImg.cols, CV_8UC3);
cvtColor(cvmColorImg, cvmHSV, CV_BGR2HSV);
std::clock_t te1 = std::clock();
float fFG, fBG;
double dp;
__declspec(align(16)) int frgb[4] = {0};
__declspec(align(16)) int fBase[4] = {g_iLowHue, g_iLowSat, g_iLowVal, 0};
__declspec(align(16)) int fIndx[4] = {0};
__m128i* pSrc1 = (__m128i*) frgb;
__m128i* pSrc2 = (__m128i*) fBase;
__m128i* pDest = (__m128i*) fIndx;
__m128i m1;
for (int y = 0; y < cvmColorImg.rows; y++)
{
for (int x = 0; x < cvmColorImg.cols; x++)
{
cv::Vec3b hsv = cvmHSV.at<cv::Vec3b>(y, x);
frgb[0] = hsv[0];hsv[1] = hsv[1];hsv[2] =hsv[2];
m1 = _mm_sub_epi32(*pSrc1, *pSrc2);
*pDest = _mm_srli_epi32(m1, g_iSValPerbinBit);
// c++ code
//fIndx[0] = ((hsv[0]-g_iLowHue)>>g_iSValPerbinBit);
//fIndx[1] = ((hsv[1]-g_iLowSat)>>g_iSValPerbinBit);
//fIndx[2] = ((hsv[2]-g_iLowVal)>>g_iSValPerbinBit);
fFG = m_cvmSkinHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
fBG = m_cvmBGHist.at<float>(fIndx[0], fIndx[1], fIndx[2]);
dp = (double)fFG/(fBG+fFG);
cvmProb.at<double>(y, x) = dp;
}
}
std::clock_t te2 = std::clock();
double dSecs1 = (double)(te1-ts)/(CLOCKS_PER_SEC);
double dSecs2 = (double)(te2-te1)/(CLOCKS_PER_SEC);
}
Upvotes: 3
Views: 1952
Reputation: 5348
I'm not too familiar with OpenCV, but I suspect you are only going to get decent throughput if you ensure the data you're accessing is already aligned outside the loop, rather than loading it unaligned inside the loop.
Upvotes: 1
Reputation: 51263
Not tested, but it should give you some ideas:
auto base = _mm_set_epi32(g_iLowHue, g_iLowSat, g_iLowVal, 0);
for (int y = 0; y < cvmColorImg.rows; y++)
{
for (int x = 0; x < cvmColorImg.cols; x++)
{
auto hsv = _mm_loadu_si128(&cvmHSV.at<cv::Vec3b>(y, x)[0]); // Would be better if cvmHSV was aligned in which case _mm_load_si128 is faster
auto m1 = _mm_sub_epi32(hsv, base);
auto m2 = _mm_srli_epi32(m1, g_iSValPerbinBit);
auto fFG = static_cast<double>(m_cvmSkinHist.at<float>(m2.m128i_i32[0], m2.m128i_i32[1], m2.m128i_i32[2]));
auto fBG = static_cast<double>(m_cvmBGHist.at<float>(m2.m128i_i32[0], m2.m128i_i32[1], m2.m128i_i32[2]));
cvmProb.at<double>(y, x) = fFG/(fBG+fFG);
}
}
Upvotes: 0
Reputation: 471379
The first problem here is that you're doing very little SSE work for a tremendous amount of data movement. You'll spend most of the time just packing/unpacking data in the SSE registers for 2 instructions...
Secondly, there is a very subtle performance penalty that will occur in this code.
You are using a buffer to transfer data between variables and SSE registers. This is a BIG NO-NO.
The reason for this is in the CPU load/store unit. When you write data to a memory location, and then immediately attempt to read it back in a different word size, it usually forces the data to be flushed all the way to cache and re-read. This can incur 20+ cycles of penalty.
This is because CPU load/store units are not optimized for this kind of unusual access.
Upvotes: 8