Reputation: 26169
I need to write a function which will compare 2-5 "files" (well really 2-5 sets of database rows, but similar concept), and I have no clue of how to do it. The resulting diff should present the 2-5 files side by side. The output should show added, removed, changed and unchanged rows, with a column for each file.
What algorithm should I use to traverse rows so as to keep complexity low? The number of rows per file is less than 10,000. I probably won't need External Merge as total data size is in the megabyte range. Simple and readable code would of course also be nice, but it's not a must.
Edit: the files may be derived from some unknown source, there is no "original" to which the other 1-4 files can be compared to; all files will have to be compared to the others in their own right somehow.
Edit 2: I, or rather my colleague, realized that the contents may be sorted, as the output order is irrelevant. This solution means using additional domain knowledge to this part of the application, but also that diff complexity is O(N) and less complicated code. This solution is simple and I'll disregards any answers to this edit when I close the bounty. However I'll answer my own question for future reference.
Upvotes: 3
Views: 246
Reputation: 15769
The case of 2 files can be solved with a standard diff algorithm. From 3 files on you can use a "majority vote" algorithm:
If more than half of the records are the same: 2 out of 3, 3 out of 4, 3 out 5 than these are the reference to consider the other record(s) changed.
Also this means quite a speedup for the algorithm if the number of changes is comparatively low.
Pseudocode:
initialize as many line indexes as there are files
while there are still at least 3 indexes incrementable
if all indexed records are the same
increment all line indexes
else
if at least one is different - check majority vote
if there is a majority
mark minority changes, increment all line indexes
else
mark minority additions (maybe randomly deciding e.g. in a 2:2 vote)
check addition or removing and set line indexes accordingly
increment all indexes
endif
endif
endwhile
Upvotes: 0
Reputation: 26169
Pseudo code (for Edit 2):
10: stored cells = <empty list>
for each column:
if cell < stored cells:
stored cells = cell
elif cell == lastCell:
stored cells += cell
if stored cells == <empty>:
return result
result += stored cells
goto 10
Upvotes: 1
Reputation: 10841
If all of the n files (where 2 <= n <= 5 for the example) have to be compared to the others, then it seems to me that the number of combinations to compare will be C(n,2), defined by (in Python, for instance) as:
def C(n,k):
return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
Thus, you would have 1, 3, 6 or 10 pairwise comparisons for n = 2, 3, 4, 5 respectively.
The time complexity would then be C(n,2) times the complexity of the pairwise diff algorithm that you chose to use, which would be an expected O(ND), in the case of Myers' algorithm, where N is the sum of the lengths of the two sequences to be compared, A and B, and D is the size of the minimum edit script for A and B.
I'm not sure about the environment in which you need this code but difflib in Python, as an example, can be used to find the differences between all sorts of sequences - not just text lines - so it might be useful to you. The difflib
documentation doesn't say exactly what algorithm it uses, but its discussion of its time complexity makes me think that it is similar to Myers'.
Upvotes: 2