Reputation: 1774
Does anybody know to to vectorize something like this using SIMD :
for(size_t i = 0; i < refSeq.length() / 4; i++){
for(size_t j = 0; j<otherSeq.length(); j++){
if(refSeq[i] == otherSeq[j]){
if(i == 0 || j == 0)
L[i][j] = 1;
else
L[i][j] = L[i-1][j-1] + 1;
}
else
L[i][j] = 0;
}
}
Upvotes: 0
Views: 932
Reputation: 1327
For some reason solution by @arunmoezhi didn't autovectorize. https://godbolt.org/z/KM4fh8TY7
.L8:
mov edx, DWORD PTR [rax+4+r8]
cmp DWORD PTR [rcx], edx
je .L5
mov DWORD PTR [rsi+4+rax], 0
add rax, 4
cmp rdi, rax
jne .L8
Here is an implementation with eve library though: https://godbolt.org/z/j8Kva8caP
The main loop looks like this (avx2) only unrolled:
.LBB0_5: # =>This Inner Loop Header: Depth=1
vmovdqu ymm2, ymmword ptr [rax + rcx]
vpcmpeqd ymm3, ymm0, ymmword ptr [rsi + rcx]
vpsubd ymm2, ymm2, ymm1
vpand ymm2, ymm3, ymm2
vmovdqu ymmword ptr [rdi + rcx], ymm2
lea rbx, [rsi + rcx]
add rbx, 32
add rcx, 32
cmp rbx, r11
jne .LBB0_5
Upvotes: 2
Reputation: 2017
This is a dynamic programming problem and a strait forward implementation has too much data dependency to be suitable for SIMD calculations.
However, if you change the algorithm from iterating row-wise to iterating diagonal-wise the whole diagonal can be calculated in parallel. See image below.
The "pseudo" code below uses a matrix with 1 extra row/column in order to simplify the "inner" calculation. This extra row/column is initialized before every diagonal-iteration.
int i, j, k;
for (k = 1; ; k++) {
int minI = k > refLen ? k - refLen : 1;
int maxI = k > otherLen ? otherLen : k - 1;
for (i = maxI; i >= minI; ) {
j = k - i;
// vectorized calculation 256 bit (AVX2)
if (i >= 32 && otherLen - j >= 32) {
// calculate 32 values of the diagonal with SIMD
i -= 32;
continue;
}
// vectorized calculation 128 bit (SSE)
if (i >= 16 && otherLen - j >= 16) {
// calculate 16 values of the diagonal with SIMD
i -= 16;
continue;
}
// scalar calculation
if (refSeq[i - 1] == otherSeq[j - 1]) {
L[i][j] = L[i - 1][j - 1] + 1;
} else {
L[i][j] = 0;
}
i--;
}
if (k == otherLen + refLen) {
break;
}
// initialize next j-endpoint in diagonal
if (k <= refLen) {
L[0][k] = 0;
}
// initialize next i-endpoint in diagonal
if (k <= otherLen) {
L[k][0] = 0;
}
}
Not sure if you wanted the actual SIMD instructions for the calculation, or just know how to parallelize/vectorize the calculation.
Upvotes: 1
Reputation: 3200
let me try to propose a solution. Initially compute the values of L[i][0] and L[0][j]. Now start iterating from i=1 and j=1. Now the check for i==0 or j==0 in each iteration of the loop can be removed. And one more advantage of this is that for every L[i][j] in each iteration on a row, the value of L[i-1][j-1] is available. Now, say if the vector registers can hold 4 elements of the array. Now we can load 4 elements of refSeq, otherSeq, L(previous row) and L(current row). Theoretically we get vectorization now. I assume auto vectorizer will not recognize this. So we have to manually do it. Please correct me if I'm wrong.
for(size_t i=0;i<refSeq.length()/4;i++)
{
if(refSeq[i]==otherSeq[0])
L[i][0]=1;
else
L[i][0]=0;
}
for(size_t j=0; j<otherSeq.length();j++)
{
if(refSeq[0]==otherSeq[j])
L[0][j]=1;
else
L[0][j]=0;
}
for(size_t i=1;i<refSeq.length()/4;i++)
{
for(size_t j=1; j<otherSeq.length();j++)
{
if(refSeq[i]==otherSeq[j])
L[i][j] = L[i-1][j-1] + 1;
else
L[i][j]=0;
}
}
One disadvantage would be that now we are loading the previous row no matter if refSeq[i] is equal to otherSeq[j] or not as with the original code where the diagonal element is accessed only if the sequences are equal.
Upvotes: 1