user277465
user277465

Reputation:

Is it possible to infer the edit distance using a unified diff?

I have 2 files with contents spanning multiple lines. I'd like to find the edit distance; i.e. how many changes are required to transform A to B assuming only insertions and deletions are possible.

> cat > A                                 
A
B
C
D
E
> cat > B                                 
A
B
D
D
F
E
> diff -u A B                             
--- A   2015-05-12 16:09:31.000000000 +0200
+++ B   2015-05-12 16:09:42.000000000 +0200
@@ -1,5 +1,6 @@
 A
 B
-C
 D
+D
+F
 E

Would it be accurate to say that the total number of + and - give me the edit distance?

Upvotes: 2

Views: 149

Answers (1)

ohaal
ohaal

Reputation: 5268

Going by your definition of edit distance (similiar to "Longest common subsequence problem"), you will first need to define what a single change is.

  • a single character?
  • a line?
  • a file?

The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

Assuming you want lines to define a change (based on your example), then yes, the total number of + and - using the diff command would suffice. This is because an update/substitution will show up as both a deletion (-) and an insertion (+).

See also http://en.wikipedia.org/wiki/Diff_utility#Unified_format

Upvotes: 2

Related Questions