Daniel W.
Daniel W.

Reputation: 623

Check for differences between two (large) files

I want to write a relatively simple program, that can backup files from my computer to a remote location and encrypt them in the process, while also computing a diff (well not really...I'm content with seeing if anything changed at all, not so much what has changed) between the local and the remote files to see which ones have changed and are necessary to update.

I am aware that there are perfectly good programs out there to do this (rsync, or others based on duplicity). I'm not trying to reinvent the wheel, it's just supposed to be a learning experience for myself

My question is regarding to the diff part of the project. I have made some assumptions and wrote some sample code to test them out, but I would like to know if you see anything I might have missed, if the assumptions are just plain wrong, or if there's something that could go wrong in a particular constelation.

Assumption 1: If files are not of equal length, they can not be the same (ie. some modification must have taken place)
Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set of these two files will have the same hash
Assumption 3: If a byte sub-set of two files is found which does not result in the same hash, the two files are not the same (ie. have been modified)

The code is written in Java and the hashing algorithm used is BLAKE-512 using the java implementation from Marc Greim.
_File1 and _File2 are 2 files > 1.5GB of type java.io.File

public boolean compareStream() throws IOException {
    int i = 0;
    int step = 4096;
    boolean equal = false;

    FileInputStream fi1 = new FileInputStream(_File1);      
    FileInputStream fi2 = new FileInputStream(_File2);

    byte[] fi1Content = new byte[step];
    byte[] fi2Content = new byte[step];

    if(_File1.length() == _File2.length()) { //Assumption 1
        while(i*step < _File1.length()) {   

            fi1.read(fi1Content, 0, step); //Assumption 2
            fi2.read(fi2Content, 0, step); //Assumption 2

            equal = BLAKE512.isEqual(fi1Content, fi2Content); //Assumption 2

            if(!equal) { //Assumption 3
                break;
            }

            ++i;
        }
    }

    fi1.close();
    fi2.close();
    return equal;
}

The calculation for two equal 1.5 GB files takes around 4.2 seconds. Times are of course much shorter when the files differ, especially when they are of different length since it returns immediately.

Thank you for your suggestions :)
..I hope this isn't too broad

Upvotes: 2

Views: 1034

Answers (1)

Alex Salauyou
Alex Salauyou

Reputation: 14338

While assumptions are correct, they won't protect from rare false positives (when method says files are equal when they aren't):

Assumption 2: If two files are the same (ie. no modification has taken place) any byte sub-set will have the same hash

This is right, but because of hash collisions you can have the situation, when hashes of chunks are the same, but chunks themselves differ.

Upvotes: 1

Related Questions