Jeremy Teitelbaum
Jeremy Teitelbaum

Reputation: 196

Is there an error in Gusfield's description of the dynamic programming algorithm for finding global alignments with constant gap penalty?

Gusfield (Algorithms on Strings, Trees, and Sequences, Section 11.8.6) describes a dynamic programming algorithm for finding the best alignment between two sequences A and B under the assumption that the penalty assigned to a gap of length x in one of the aligned sequences is of the form R+Sx for constants R and S. In the special case where S=0, there is a penalty for beginning a gap but no penalty for lengthening it. It seems to me that there is an error in Gusfield's formulae and I hope that someone can help me understand the true state of affairs.

Gusfield defines four functions V(i,j), G(i,j), E(i,j) and F(i,j) with the following properties:

The recurrences for i and j greater than or equal to 1 are:

V(i,j)=max[E(i,j),F(i,j),G(i,j)]
G(i,j)=V(i,j)+w(A[i],B[j]) where w is the score for a match/mismatch.
E(i,j)=max(E(i,j-1),V(i,j-1)-R]
F(i,j)=max(F(i-1,j),V(i-1,j)-R]

Finally the boundary conditions are given as:

V(i,0)=E(i,0)=-R
V(0,j)=F(0,j)=-R

With all that as preliminaries, consider the situation where we have two sequences each of length one, so that A=p and B=q. There are only three possible alignments in this situation:

p    p-    -p
q    -q    q-

These have scores respectively w(p,q), -2R, -2R. In particular we should have E(0,1)=F(1,0)=-2R. However, the recurrences give that E(0,1) and F(1,0) are both bigger than or equal to -R.

This error in the boundary conditions has consequences. Suppose for example that A=pppppp...p and B=qqqq...q with p and q different. The alignment that sets A completely off from B:

A-
-B

should score as -2R (it has two gaps) and so this alignment is eventually optimal assuming w(p,q)<0.

Gusfield's algorithm does not seem to handle this case correctly.

In practical situations this probably doesn't matter, because typical strings have a lot of matches and so the boundary cases don't contribute to the solution.

Comments/criticisms welcome.

Upvotes: 4

Views: 242

Answers (1)

Gianluca Della Vedova
Gianluca Della Vedova

Reputation: 359

Yes, two of the equations are actually incorrect. They should be

F(i,j)=max(F(i,j-1),V(i,j-1)-R] E(i,j)=max(E(i-1,j),V(i-1,j)-R]

Since E(i,j) is the best possible score for alignments of these prefixes under the condition that A[i] is paired against a gap in B, the alignment is made of the optimal alignment of A[1:i-l] against B[1:j] and an l-long gap (the indels are against A[i-l+1:i]).

Upvotes: 2

Related Questions