vanste25
vanste25

Reputation: 1774

Vectorizing Nested Loop - SIMD

Does anybody know to to vectorize something like this using SIMD :

for(size_t i = 0; i < refSeq.length() / 4; i++){

    for(size_t j = 0; j<otherSeq.length(); j++){
    if(refSeq[i] == otherSeq[j]){
        if(i == 0 || j == 0)
            L[i][j] = 1;
       else
        L[i][j] = L[i-1][j-1] + 1;
    }
       else
        L[i][j] = 0;
    }
}

Upvotes: 0

Views: 932

Answers (3)

Denis Yaroshevskiy
Denis Yaroshevskiy

Reputation: 1327

For some reason solution by @arunmoezhi didn't autovectorize. https://godbolt.org/z/KM4fh8TY7

.L8:
        mov     edx, DWORD PTR [rax+4+r8]
        cmp     DWORD PTR [rcx], edx
        je      .L5
        mov     DWORD PTR [rsi+4+rax], 0
        add     rax, 4
        cmp     rdi, rax
        jne     .L8

Here is an implementation with eve library though: https://godbolt.org/z/j8Kva8caP

The main loop looks like this (avx2) only unrolled:

.LBB0_5:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm2, ymmword ptr [rax + rcx]
        vpcmpeqd        ymm3, ymm0, ymmword ptr [rsi + rcx]
        vpsubd  ymm2, ymm2, ymm1
        vpand   ymm2, ymm3, ymm2
        vmovdqu ymmword ptr [rdi + rcx], ymm2
        lea     rbx, [rsi + rcx]
        add     rbx, 32
        add     rcx, 32
        cmp     rbx, r11
        jne     .LBB0_5

Upvotes: 2

gustf
gustf

Reputation: 2017

This is a dynamic programming problem and a strait forward implementation has too much data dependency to be suitable for SIMD calculations.

However, if you change the algorithm from iterating row-wise to iterating diagonal-wise the whole diagonal can be calculated in parallel. See image below.

enter image description here

The "pseudo" code below uses a matrix with 1 extra row/column in order to simplify the "inner" calculation. This extra row/column is initialized before every diagonal-iteration.

int i, j, k;
for (k = 1; ; k++) {
    int minI = k > refLen ? k - refLen : 1;
    int maxI = k > otherLen ? otherLen : k - 1;

    for (i = maxI; i >= minI; ) {
        j = k - i;

        // vectorized calculation 256 bit (AVX2)
        if (i >= 32 && otherLen - j >= 32) {
            // calculate 32 values of the diagonal with SIMD
            i -= 32;
            continue;
        }

        // vectorized calculation 128 bit (SSE)
        if (i >= 16 && otherLen - j >= 16) {
            // calculate 16 values of the diagonal with SIMD
            i -= 16;
            continue;
        }

        // scalar calculation
        if (refSeq[i - 1] == otherSeq[j - 1]) {
            L[i][j] = L[i - 1][j - 1] + 1;
        } else {
            L[i][j] = 0;
        }
        i--;
    }

    if (k == otherLen + refLen) {
        break;
    }

    // initialize next j-endpoint in diagonal
    if (k <= refLen) {
        L[0][k] = 0;
    }
    // initialize next i-endpoint in diagonal
    if (k <= otherLen) {
        L[k][0] = 0;
    }
}

Not sure if you wanted the actual SIMD instructions for the calculation, or just know how to parallelize/vectorize the calculation.

Upvotes: 1

arunmoezhi
arunmoezhi

Reputation: 3200

let me try to propose a solution. Initially compute the values of L[i][0] and L[0][j]. Now start iterating from i=1 and j=1. Now the check for i==0 or j==0 in each iteration of the loop can be removed. And one more advantage of this is that for every L[i][j] in each iteration on a row, the value of L[i-1][j-1] is available. Now, say if the vector registers can hold 4 elements of the array. Now we can load 4 elements of refSeq, otherSeq, L(previous row) and L(current row). Theoretically we get vectorization now. I assume auto vectorizer will not recognize this. So we have to manually do it. Please correct me if I'm wrong.

for(size_t i=0;i<refSeq.length()/4;i++)
{
    if(refSeq[i]==otherSeq[0])
        L[i][0]=1;
    else
        L[i][0]=0;
}
for(size_t j=0; j<otherSeq.length();j++)
{
    if(refSeq[0]==otherSeq[j])
        L[0][j]=1;
    else
        L[0][j]=0;
}

for(size_t i=1;i<refSeq.length()/4;i++)
{
    for(size_t j=1; j<otherSeq.length();j++)
    {
        if(refSeq[i]==otherSeq[j])
            L[i][j] = L[i-1][j-1] + 1;
        else
            L[i][j]=0;
    }
}

One disadvantage would be that now we are loading the previous row no matter if refSeq[i] is equal to otherSeq[j] or not as with the original code where the diagonal element is accessed only if the sequences are equal.

Upvotes: 1

Related Questions