Jonas Byström
Jonas Byström

Reputation: 26169

Comparing five different sources

I need to write a function which will compare 2-5 "files" (well really 2-5 sets of database rows, but similar concept), and I have no clue of how to do it. The resulting diff should present the 2-5 files side by side. The output should show added, removed, changed and unchanged rows, with a column for each file.

What algorithm should I use to traverse rows so as to keep complexity low? The number of rows per file is less than 10,000. I probably won't need External Merge as total data size is in the megabyte range. Simple and readable code would of course also be nice, but it's not a must.

Edit: the files may be derived from some unknown source, there is no "original" to which the other 1-4 files can be compared to; all files will have to be compared to the others in their own right somehow.

Edit 2: I, or rather my colleague, realized that the contents may be sorted, as the output order is irrelevant. This solution means using additional domain knowledge to this part of the application, but also that diff complexity is O(N) and less complicated code. This solution is simple and I'll disregards any answers to this edit when I close the bounty. However I'll answer my own question for future reference.

Upvotes: 3

Views: 246

Answers (3)

Wolfgang Fahl
Wolfgang Fahl

Reputation: 15769

The case of 2 files can be solved with a standard diff algorithm. From 3 files on you can use a "majority vote" algorithm:

If more than half of the records are the same: 2 out of 3, 3 out of 4, 3 out 5 than these are the reference to consider the other record(s) changed.

Also this means quite a speedup for the algorithm if the number of changes is comparatively low.

Pseudocode:
  initialize as many line indexes as there are files
  while there are still at least 3 indexes incrementable
    if all indexed records are the same 
      increment all line indexes
    else 
      if at least one is different - check majority vote
      if there is a majority
        mark minority changes, increment all line indexes
      else
        mark minority additions (maybe randomly deciding e.g. in a 2:2 vote)
        check addition or removing and set line indexes accordingly
        increment all indexes
      endif
    endif
  endwhile

Upvotes: 0

Jonas Byström
Jonas Byström

Reputation: 26169

Pseudo code (for Edit 2):

10: stored cells = <empty list>
for each column:
  if cell < stored cells:
    stored cells = cell
  elif cell == lastCell:
    stored cells += cell

if stored cells == <empty>:
  return result
result += stored cells
goto 10

Upvotes: 1

Simon
Simon

Reputation: 10841

If all of the n files (where 2 <= n <= 5 for the example) have to be compared to the others, then it seems to me that the number of combinations to compare will be C(n,2), defined by (in Python, for instance) as:

def C(n,k): 
    return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))

Thus, you would have 1, 3, 6 or 10 pairwise comparisons for n = 2, 3, 4, 5 respectively.

The time complexity would then be C(n,2) times the complexity of the pairwise diff algorithm that you chose to use, which would be an expected O(ND), in the case of Myers' algorithm, where N is the sum of the lengths of the two sequences to be compared, A and B, and D is the size of the minimum edit script for A and B.

I'm not sure about the environment in which you need this code but difflib in Python, as an example, can be used to find the differences between all sorts of sequences - not just text lines - so it might be useful to you. The difflib documentation doesn't say exactly what algorithm it uses, but its discussion of its time complexity makes me think that it is similar to Myers'.

Upvotes: 2

Related Questions