Is there an error in Gusfield's description of the dynamic programming algorithm for finding global alignments with constant gap penalty?

Question

Gusfield (Algorithms on Strings, Trees, and Sequences, Section 11.8.6) describes a dynamic programming algorithm for finding the best alignment between two sequences A and B under the assumption that the penalty assigned to a gap of length x in one of the aligned sequences is of the form R+Sx for constants R and S. In the special case where S=0, there is a penalty for beginning a gap but no penalty for lengthening it. It seems to me that there is an error in Gusfield's formulae and I hope that someone can help me understand the true state of affairs.

Gusfield defines four functions V(i,j), G(i,j), E(i,j) and F(i,j) with the following properties:

V(i,j) is the best score possible for alignments of the prefixes A[1..i] and B[1..j].
E(i,j) is the best possible score for alignments of these prefixes under the condition that B[j] is paired against a gap in A.
F(i,j) is the best possible score for alignments of these prefixes under the condition that A[i] is paired against a gap in B.
G(i,j) is the best possible score for alignments that pair A[i] with B[j].

The recurrences for i and j greater than or equal to 1 are:

V(i,j)=max[E(i,j),F(i,j),G(i,j)]
G(i,j)=V(i,j)+w(A[i],B[j]) where w is the score for a match/mismatch.
E(i,j)=max(E(i,j-1),V(i,j-1)-R]
F(i,j)=max(F(i-1,j),V(i-1,j)-R]

Finally the boundary conditions are given as:

V(i,0)=E(i,0)=-R
V(0,j)=F(0,j)=-R

With all that as preliminaries, consider the situation where we have two sequences each of length one, so that A=p and B=q. There are only three possible alignments in this situation:

p    p-    -p
q    -q    q-

These have scores respectively w(p,q), -2R, -2R. In particular we should have E(0,1)=F(1,0)=-2R. However, the recurrences give that E(0,1) and F(1,0) are both bigger than or equal to -R.

This error in the boundary conditions has consequences. Suppose for example that A=pppppp...p and B=qqqq...q with p and q different. The alignment that sets A completely off from B:

A-
-B

should score as -2R (it has two gaps) and so this alignment is eventually optimal assuming w(p,q)<0.

Gusfield's algorithm does not seem to handle this case correctly.

In practical situations this probably doesn't matter, because typical strings have a lot of matches and so the boundary cases don't contribute to the solution.

Comments/criticisms welcome.

Gianluca Della Vedova · Accepted Answer

Yes, two of the equations are actually incorrect. They should be

F(i,j)=max(F(i,j-1),V(i,j-1)-R] E(i,j)=max(E(i-1,j),V(i-1,j)-R]

Since E(i,j) is the best possible score for alignments of these prefixes under the condition that A[i] is paired against a gap in B, the alignment is made of the optimal alignment of A[1:i-l] against B[1:j] and an l-long gap (the indels are against A[i-l+1:i]).

Is there an error in Gusfield's description of the dynamic programming algorithm for finding global alignments with constant gap penalty?

Answers (1)

Related Questions

Is there an error in Gusfield&#39;s description of the dynamic programming algorithm for finding global alignments with constant gap penalty?

Answers (1)

Related Questions

Is there an error in Gusfield's description of the dynamic programming algorithm for finding global alignments with constant gap penalty?