How to find differences between 2 almost-identical files very fast?

Question

If you have two mainly identical files with 1000s of records, how will you write code to find differences between them. Assume that unix/linux commands are not allowed to be used.

My idea:

Because most of entries are the same, we can sort the entries of the two files, then compare each entry one by one, e.g. entry i in file1 compare to entry i in file2. It is O(n lg n), n is size of file.

Are there O(n) solution ?

user684934 · Accepted Answer

Hash tables are your friends.

Get a record from file 1.
Hash it.
Get the corresponding memory address.
Set it to 1.
Repeat for all records in file 1.
Repeat for all records in file 2, but add 2 instead of setting to 1.

Now you know which record exists in both files (value 3), which exists in the first file only (value 1), and which exists in the second file only (value 2). And in linear time.

Note: If you're implementing your own hash table, you have to handle growing the size of your table as needed, as well as collisions. I'm sure if you could do that then you wouldn't have a hard time with this question, so use a library.

How to find differences between 2 almost-identical files very fast?

Answers (1)

Related Questions