Reputation:
I have 2 files with contents spanning multiple lines. I'd like to find the edit distance; i.e. how many changes are required to transform A to B assuming only insertions and deletions are possible.
> cat > A
A
B
C
D
E
> cat > B
A
B
D
D
F
E
> diff -u A B
--- A 2015-05-12 16:09:31.000000000 +0200
+++ B 2015-05-12 16:09:42.000000000 +0200
@@ -1,5 +1,6 @@
A
B
-C
D
+D
+F
E
Would it be accurate to say that the total number of +
and -
give me the edit distance?
Upvotes: 2
Views: 149
Reputation: 5268
Going by your definition of edit distance (similiar to "Longest common subsequence problem"), you will first need to define what a single change is.
The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the
diff
utility, and has applications in bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.
Assuming you want lines to define a change (based on your example), then yes, the total number of +
and -
using the diff
command would suffice. This is because an update/substitution will show up as both a deletion (-
) and an insertion (+
).
See also http://en.wikipedia.org/wiki/Diff_utility#Unified_format
Upvotes: 2