Reputation: 1839
I am trying to figure out if there are existing algorithms that can detect changes between two files in terms of additions but also reorders. I have an example below:
processes = 1
a = 0
allactive = []
processes = 2
a = 0
allrecords = range(10)
allactive = []
a = 0
allrecords = range(10)
allactive = []
processes = 2
I need to be able to say that for example user1 code is the three initial lines of code, user 2 added the "allrecords = range(10)" part (as well as a number change), and user 3 did not change anything since he/she just reordered the code.
Ideally, at commit 3, I want to be able to look at the code and say that from character 0 to 20 (this is user1's code), 21-25 user2's code, 26-30 user1's code etc.
I know there are two popular algorithms, Longest common subsequence and longest common substring but I am not sure which one can correctly count additions of new code but be able also to identify reorders.
Of course this still leaves out the question of having the same substring existing twice in a text. Are there any other algorithms that are better suited to this problem?
Upvotes: 4
Views: 280
Reputation: 95402
Each "diff" algorithm defines a set of possible code-change edit types, and then (typically) tries to find the smallest set of such changes that explains how the new file resulted from the old. Usually such algorithms are defined purely syntactically; semantics are not taken into account.
So what you want, based on your example, is an algorithm that allow "change line", "insert line", "move line" (and presumably "delete line" [not in your example but necessary for a practical set of edits]). Given this you ought to be able to define a dynamic programming algorithm to find a smallest set of edits to explain how one file differs from another. Note that this set is defined in terms of edits to whole-lines, rather like classical "diff"; of course classical diff does not have "change line" or "move line" which is why you are looking for something else.
You could pick different types of deltas. Your example explicitly noted "number change"; if narrowly interpreted, this is NOT an edit on lines, but rather within lines. Once you start to allow partial line edits, you need to define how much of a partial line edit is allowed ("unit of change"). (Will your edit set allow "change of digit"?)
Our Smart Differencer family of tools defines the set of edits over well-defined sub-phrases of the targeted language; we use formal language grammar (non)terminals as the unit of change. [This makes each member of the family specific to the grammar of some language] Deltas include programmer-centric concepts such as "replace phrase by phrase", "delete listmember", "move listmember", "copy listmember", "rename identifier"; the algorithm operates by computing a minimal tree difference in terms of these operations. To do this, the SmartDifferencer needs (and has) a full parser (producing ASTs) for the language.
You didn't identify the language for your example. But in general, for a language looking like that, the SmartDifferencer would typically report that User2 commit changes were:
and that User3 commit changes were:
If you know who contributed the original code, with the edits you can straightforwardly determine who contributed which part of the final answer. You have to decide the unit-of-reporting; e.g., if you want report such contributions on a line by line basis for easy readability, or if you really want to track that Mary wrote the code, but Joe modified the number.
To detect that User3's change is semantically null can't be done with purely syntax-driven diff tool of any kind. To do this, the tool has to be able to compute the syntactic deltas somehow, and then compute the side effects of all statements (well, "phrases"), requiring a full static analyzer of the language to interpret the deltas to see if they have such null effects. Such a static analyzer requires a parser anyway so it makes sense to do a tree based differencer, but it also requires a lot more than just parser [We have such language front ends and have considered building such tools, but haven't gotten there yet].
Bottom line: there is no simple algorithm for determining "that user3 did not change anything". There is reasonable hope that such tools can be built.
Upvotes: 2