Reputation: 921
I have vector of int and I need to find and replace some elements with specific value. Both of them are the same.
For example: replace 4 to 8 for all elements.
I'm trying direct memory access in loop in c++. But it still to slow for me.
Update:
I'm working with OpenCV Mat
object on x86
:
for (int i = 0; i < labels.rows; ++i) {
for (int j = 0; j < labels.cols; ++j) {
int& label = labels.at<int>(i, j);
if (label == oldValue) {
label = newValue;
}
}
}
Mat.at()
function just return value by pointer in release mode
template<typename _Tp> inline
_Tp& Mat::at(int i0, int i1)
{
CV_DbgAssert(dims <= 2);
CV_DbgAssert(data);
CV_DbgAssert((unsigned)i0 < (unsigned)size.p[0]);
CV_DbgAssert((unsigned)(i1 * DataType<_Tp>::channels) < (unsigned)(size.p[1] * channels()));
CV_DbgAssert(CV_ELEM_SIZE1(traits::Depth<_Tp>::value) == elemSize1());
return ((_Tp*)(data + step.p[0] * i0))[i1];
}
Upvotes: 1
Views: 1455
Reputation: 365707
The key to letting the compiler auto-vectorize is to always assign to the element, even if you assign it to itself. (The ternary operator is good here, see @nemequ's answer). This lets the compiler do a read / rewrite of unchanged values, so it can vectorize with a load + compare and blend + store.
The compiler can't invent writes to memory locations that the C++ source doesn't write to, because that could step on writes from another thread. It's not a data race for different threads to read/write adjacent array elements. If another function the compiler doesn't know about was also using a vector-load / blend / store loop with a different search/replace value, their stores would step on each other. So this vectorization strategy only works if the source writes all the elements. The compiler is free to optimize that away (e.g. if it doesn't vectorize).
Comments on the other answer point out the down-side of unconditionally storing: it dirties the cache even if the data doesn't change. If search hits are rare, it could be worth branching to skip the store and save memory bandwidth, especially if multiple threads will be running this over large blocks of memory. Including in multiple instances of the program running on the same machine, but especially in a shared-memory situation.
AVX introduced masked-store instructions which solve this problem. AVX2 vpmaskmovd
and AVX1 vmaskmovps
both have 32-bit granularity, so you can use them directly for int
data. For narrower elements, you could compare+blend with byte or word granularity, then check for changes with dword granularity to generate a mask.
I think the implementation of vpmaskmovd
(in Skylake at least) really does avoid dirtying the cache line when the mask is all-0. According to Intel's optimization manual: 11.9 CONDITIONAL SIMD PACKED LOADS AND STORES, with a masked-store -> any reload: If the mask is all 0 the loads do not depend on the masked store. So the store queue knows that an all-zero mask makes the store a no-op.
I haven't tested, but I expect it avoids dirtying the cache line in this case, at least on Skylake (including Skylake-client which doesn't support AVX512; but it does have the microarchitectural features that AVX512 needs, like efficient masked stores). Masked elements are even allowed to touch illegal addresses without faulting, and some CPUs can do that (at least for the all-zero-mask case) without trapping for a microcode assist. So that would mean they have a way to squash the store entirely.
So the asm you'd want the compiler to make (via intrinsics or auto-vectorization) is:
;; outside the loop: ymm4 = set1_epi32(4); ymm5 = set1_epi32(8);
vpcmpeqd ymm0, [rdi], ymm4 ; ymm0 = _mm256_cmpeq_epi32
vpmaskmovd [rdi], ymm0, ymm5 ; store 8 in elements where ymm0 is -1
add rdi, 32
I haven't benchmarked this to see if it's actually faster (or at least equal when memory bandwidth isn't a bottleneck, which would be an easier microbenchmark to design).
A vpmaskmovd
store is only 3 uops on Skylake (p0
+ store-address + store-data). It's 4 uops on Haswell.
According to Agner Fog's testing, vmaskmovps
-store is 4 uops on Skylake. It's very strange that it doesn't match the integer instruction that behaves identically.
Using a conditional masked store means you don't need the original data, so it allows folding the load into the vpcmpeqd
. The load + cmp+blend + store would nee 1 + 2 + 1 instructions, and vpblendvb
is 2 uops. (So is vblendps
). So masked stores in theory are faster.
vpblendvb
on Haswell can only run on port 5, so that would limit you to processing 32 bytes every other clock, instead of one vector per 1.25 clocks (with an infinite unroll). Most of the time 32 bytes per 2 clocks is fine, though, but if your data is hot in L1D cache then it's a bottleneck.
With AVX512, you'd probably implement it the same way, but with AVX512BW you could use the same masked-store strategy for smaller granularity than 32-bit. Compare into k1
, and vmovdqu8 [mem]{k1}, zmm8
Without AVX: DO NOT USE SSE maskmovdqu
; it's slow, and implicitly NT so it flushes the cache line, and all that. Use load+blend+store.
Upvotes: 3
Reputation: 17512
You didn't mention what architecture you're developing for, so it's impossible to tell you which intrinsics to use. Luckily your compiler should be able to auto-vectorize something like
for (int i = 0 ; i < N ; i++)
foo[i] = (foo[i] == 4) ? 8 : foo[i];
Assuming your data is sufficiently aligned, with -mavx2 -O3
GCC will use vpcmpeqd and vpblendvb.
Upvotes: 5