Reputation: 1673
I am finding a way to compare the upper part between two __m128d variable. So I look up https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for relative intrinsics.
But I only can find some intrinsics comparing the lower part between two variable, for example, _mm_comieq_sd
.
I am wonder why there is not intrinsics about comparing the upper part, and more importantly, how to compare the upper part between two __m128d variable?
Update:
The code is like
j0 = jprev0;
j1 = jprev1;
t_0 = p_i_x - pj_x_0;
t_1 = p_i_x - pj_x_1;
r2_0 = t_0 * t_0;
r2_1 = t_1 * t_1;
t_0 = p_i_y - pj_y_0;
t_1 = p_i_y - pj_y_1;
r2_0 += t_0 * t_0;
r2_1 += t_1 * t_1;
t_0 = p_i_z - pj_z_0;
t_1 = p_i_z - pj_z_1;
r2_0 += t_0 * t_0;
r2_1 += t_1 * t_1;
#if NAMD_ComputeNonbonded_SortAtoms != 0 && ( 0 PAIR ( + 1 ) )
sortEntry0 = sortValues + g;
sortEntry1 = sortValues + g + 1;
jprev0 = sortEntry0->index;
jprev1 = sortEntry1->index;
#else
jprev0 = glist[g ];
jprev1 = glist[g+1];
#endif
pj_x_0 = p_1[jprev0].position.x;
pj_x_1 = p_1[jprev1].position.x;
pj_y_0 = p_1[jprev0].position.y;
pj_y_1 = p_1[jprev1].position.y;
pj_z_0 = p_1[jprev0].position.z;
pj_z_1 = p_1[jprev1].position.z;
// want to use sse to compare those
bool test0 = ( r2_0 < groupplcutoff2 );
bool test1 = ( r2_1 < groupplcutoff2 );
//removing ifs benefits on many architectures
//as the extra stores will only warm the cache up
goodglist [ hu ] = j0;
goodglist [ hu + test0 ] = j1;
hu += test0 + test1;
And I am trying to rewrite it with SSE.
Upvotes: 2
Views: 651
Reputation: 363942
You're asking how to compare upper halves after already having compared the lower halves.
The SIMD way to do compares is with a packed compare instruction, like __m128d _mm_cmplt_pd (__m128d a, __m128d b)
, which produces a mask as an output instead of setting flags. AVX has an improved vcmppd / vcmpps
which has a wider choice of compare operators, which you pass as a 3rd arg. _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
.
const __m128d groupplcutoff2_vec = _mm_broadcastsd_pd(groupplcutoff2);
// should emit SSE3 movddup like _mm_movedup_pd() would.
__m128d r2 = ...;
// bool test0 = ( r2_0 < groupplcutoff2 );
// bool test1 = ( r2_1 < groupplcutoff2 );
__m128d ltvec = _mm_cmplt_pd(r2, groupplcutoff2_vec);
int ltmask = _mm_movemask_pd(ltvec);
bool test0 = ltmask & 1;
// bool test1 = ltmask & 2;
// assuming j is double. I'm not sure from your code, it might be int.
// and you're right, doing both stores unconditionally is prob. fastest, if your code isn't heavy on stores.
// goodglist [ hu ] = j0;
_mm_store_sd (goodglist [ hu ], j);
// goodglist [ hu + test0 ] = j1;
_mm_storeh_pd(goodglist [ hu + test0 ], j);
// don't try to use non-AVX _mm_maskmoveu_si128, it's like movnt. And doesn't do exactly what this needs, anyway, without shuffling j and ltvec.
// hu += test0 + test1;
hu += _popcnt32(ltmask); // Nehalem or later. Check the popcnt CPUID flag
The popcnt trick will work just as efficiently with AVX (4 doubles packed in a ymm register). Packed-compare -> movemask and using bit manipulation is a useful trick to keep in mind.
Upvotes: 2