Kelly
Kelly

Reputation: 213

Finding duplicate files using Python

This was a question that appeared in a Python coding competition and was wondering how this can be achieved.

Problem statement:

You have two directories(with possible subdirectories in it). Your script should find out duplicate files by comparing contents of the same filenames in the two root directories

Result: FAIL: If contents of atleast one same filename differs

PASS: Otherwise

Here's a sample figure

 /dir1                       /dir2
       -- file1                   -- file1 
       -- file2                   -- fileA  
       -- file3                   -- fileB   
       -- ....
       -- ...
       ---/subDir1
            --file1
            --file2

file1 of dir1 contains :- foo bar
file1  of dir2 contains :- foo 
Result - Fail

file1  of dir1 contains :- foo bar
file1  of dir2 contains :- foo bar
Result - Pass.

I attempted using hashing by file size, but it was obviously not the way :)

PS: Any scripting language can be used.

Thanks Kelly

Upvotes: 1

Views: 668

Answers (2)

Janne Karila
Janne Karila

Reputation: 25197

Have a look at the filecmp module in the standard library.

Computing hashes is not useful when each file is compared to just one other file. The entire file must be read to compute a hash, and read again to confirm a match. By contrast, a direct comparison can be aborted at first difference.

Upvotes: 1

anijhaw
anijhaw

Reputation: 9402

You can solve this in a tiered manner.

  1. Go through each dir/subdir, compare the size of the files.
  2. If the file size is different => fail
  3. Compute the SHA1 hash of each file if that does not match => fail
  4. If SHA1 hashes match do a byte by byte comparision of the contents in the file to be absolutely sure.

Upvotes: 3

Related Questions