Reputation: 2514
I implemented both Myers versions of the diff algorithm, first the greedy one (which works, but is highly ressource intensive when it comes to memory consumption - but Myers writes that in his paper already) and then the refined version.
However, that fails on simple inputs like lcs(»xx«, »x«).
The reason from what I understand is, that the common snake it tries to find, is ambiguous: Let's mark the snakes with »L[...]« for »left snake« and »R[...]« for »right snake« to distinguish them. When going forward in step d = 0, the algorithm detects this snakes: »L[x]x«->»L[x]«. Now in the reverse step, the algorithm detects this snakes: »xR[x]«->»R[x]«. In the forward d = 1 step, the algorithm finds the seconds (empty) snake yielding: »L[x]xR[]«->L[x]R[]« and in the reserse step it finds »L[]xR[x]«->L[]R[x]«.
As you can see, the x-matching in both directions find the first possible snake, but in the forward case, it's L[x] but in the reverse case its R[x].
So this algorithm doesn't find a valid common snake.
In Lemma 3, Myers proves, that IF there is a D-path from (0,0) to (n,m) then there MUST be a D/2-path from (0,0) to some (x,y) and a D/2-path from (x,y) to (n,m). I'll skip the prove, because it's rather obvious, that you can just cut the D-path at any »valid« point to get two shorter paths, that combine back to the one D-path and that you can just take the middle of the D-path to do the cutting part. Anyways: He does prove the existence of such an (x,y) pair (actually, he proves this a little bit more sophisticated by adding a second point (u,v) - but for understanding of my question, that doesn't really matter), but (from my understanding) he does NOT prove, that using his algorithm, you're guaranteed to actually find any of this possible cut-points (x,y) and (u,v) for the D-path.
As shown above, I believe, Myers' refined algorithm actually can't find the solution for »xx«->»x«.
Am I missing something important here?
Upvotes: 1
Views: 267
Reputation: 155024
Am I missing something important here?
Yeah - Myers defines "overlap" in such a way that his algorithm doesn't actually require what you call a "common snake" at all.
Recall that this is the algorithm:
∆ ← N−M
For D ← 0 to ⌈(M+N)/2⌉ Do
For k ← −D to D in steps of 2 Do
Find the end of the furthest reaching forward D-path in diagonal k.
If ∆ is odd and k ∈ [∆−(D−1),∆+(D−1)] Then
If the path overlaps the furthest reaching reverse (D − 1)-path in diagonal k Then
Length of an SES is 2D−1.
The last snake of the forward path is the middle snake.
For k ← −D to D in steps of 2 Do
Find the end of the furthest reaching reverse D-path in diagonal k+∆.
If ∆ is even and k + ∆ ∈ [−D,D] Then
If the path overlaps the furthest reaching forward D-path in diagonal k+∆ Then
Length of an SES is 2D.
The last snake of the reverse path is the middle snake.
In your example where the old text is "xx" and the new text is "x", we have a width 3 by height 2 edit graph, N = 2, M = 1, ∆ = 1 (which is odd).
We start the outermost loop of the algorithm. When D=0, as you note, the forward-reaching path moves diagonally from (0, 0) to (1, 1) and then the reverse path moves diagonally from (2, 1) to (1, 0).
Next we consider D=1. As the algorithm directs us, first we consider extensions to forward-reaching paths, onto diagonals -1 and 1. When we consider diagonal 1 we find we can move horizontally onto it with a move from (1, 1) to (2, 1). As you note, we've now reached the end of the edit graph and there was no "common" snake on this diagonal; the last snake of the furthest-reaching reverse path on this diagonal goes from (2, 1) to (1, 0) while the last snake of forward path is an empty one from (2, 1) to (2, 1).
But, the algorithm doesn't require a common snake. Instead we check...
If ∆ is odd and k ∈ [∆−(D−1),∆+(D−1)]
(yep and yep)
Then
If the path overlaps the furthest reaching reverse (D − 1)-path in diagonal k
Length of an SES is 2D−1.
The last snake of the forward path is the middle snake.
What does "overlaps" mean here? It was defined earlier in lemma 3; a forward path ending at (x, y) is said to "overlap" a reverse path ending at (u, v) if:
x−y = u−v and x ≥ u.
Our forward and reverse path endpoints on diagonal 1 - respectively (2, 1) and (1, 0) - do in fact meet this definition, even though the snakes merely touch at a vertex and don't "overlap" in the (possibly more intuitive) sense of having some edges in common. Indeed, they could "overlap" per Myers's definition sense without touching at all, as would be the case if you instead diffed "x" against "xxx".
Hence we now have in fact found a (empty) "middle snake" from (2, 1) to (2, 1), and the middle snake algorithm has succeeded.
Upvotes: 0