P1C Unrelated
P1C Unrelated

Reputation: 221

MD5 checks for repeated files in folder?

Let's say I have a folder with five hundred pictures in it, and I want to check for repeats and delete them.

Here's the code I have right now:

using (var md5 = MD5.Create())
{
    using (var stream = File.OpenRead(filename))
    {
        return md5.ComputeHash(stream);
    }
}

Would this be viable to spot repeated MD5s in a specific folder, provided I loop it accordingly?

Upvotes: 0

Views: 455

Answers (2)

hagello
hagello

Reputation: 3245

Creating hashes in order to identify identical files is OK, in any programming language, on any OS. It is slow, though, because you read the whole file even if that is not necessary.

I would recommend several passes for finding duplicates:

  1. get the size of all files
  2. for all files of equal size: get the hash of the first, say, 1k bytes
  3. for all files of equal size and equal hash of first 1k: get the hash of the entire file

There is a risk of hash collisions. You cannot avoid it with hash algorithms. As MD5 uses 128 bits, the risk is 1 : (1 << 128) (roughly 0.0000000000000000000000000000000000000001) for two random files. Your chances of getting the jackpot in your national lottery four times in a row, using only one lottery ticket each week, are much better than getting a hash collision on a random pair of files.

Though the probability of a hash collision raises somewhat, if you compare the hash of many files. The mathematically interested and people implementing hash containers should look up the "birthday problem". Mere mortals trust MD5 hashes when they are not implementing cryptographic algorithms.

Upvotes: 2

user2991535
user2991535

Reputation: 119

using System;
using System.IO;
using System.Collections.Generic;
internal static class FileComparer
{
    public static void Compare(string directoryPath)
    {           
        if(!Directory.Exists(directoryPath))
        {
            return;
        }
        FileComparer.Compare(new DirectoryInfo(directoryPath));
    }
    private static void Compare(DirectoryInfo info)
    {           
        List<FileInfo> files = new List<FileInfo>(info.EnumerateFiles());
        foreach(FileInfo file in files)
        {
            if(file.Exists)
            {
                byte[] array = File.ReadAllBytes(file.FullName);
                foreach(FileInfo file2 in files)
                {                       
                    int length = array.Length;
                    byte[] array2 = File.ReadAllBytes(file2.FullName);
                    if(array2.Length == length)
                    {
                        bool flag = true;
                        for(int current = 0; current < length; current++)
                        {
                            if(array[current] != array2[current])
                            {
                                flag = false;
                                break;
                            }
                        }
                        if(flag)
                        {
                            file2.Delete();
                        }                       
                    }
                }
            }
        }
    }
}

Upvotes: 1

Related Questions