Dan Cundy
Dan Cundy

Reputation: 2849

How to compare files using Byte Array and Hash

Background

I am converting media files to a new format and need a way of knowing if I've previously in current runtime, converted a file.

My solution

To hash each file and store the hash in an array. Each time I go to convert a file I hash it and check the hash against the hashes stored in the array.

Problem

My logic doesn't seem able to detect when I've already seen a file and I end up converting the same file multiple times.

Code

//Byte array of already processed files
 private static readonly List<byte[]> Bytelist = new List<byte[]>();      

        public static bool DoCheck(string file)
        {
            FileInfo info = new FileInfo(file);

            while (FrmMain.IsFileLocked(info)) //Make sure file is finished being copied/moved
            {
                Thread.Sleep(500);
            }

            //Get byte sig of file and if seen before dont process
            byte[] myFileData = File.ReadAllBytes(file);
            byte[] myHash = MD5.Create().ComputeHash(myFileData);

            if (Bytelist.Count != 0)
            {
                foreach (var item in Bytelist)
                {
                    //If seen before ignore
                    if (myHash == item)
                    {
                        return true;
                    }
                }
            }
            Bytelist.Add(myHash);
            return false;
        }

Question

Is there more efficient way of trying to acheive my end goal? What am I doing wrong?

Upvotes: 0

Views: 1949

Answers (4)

Stefano d&#39;Antonio
Stefano d&#39;Antonio

Reputation: 6152

There are multiple questions, I'm going to answer the first one:

Is there more efficient way of trying to acheive my end goal?

TL;DR yes.

You're storing hashes and comparing hashes only for the files, which is a really expensive operation. You can do other checks before calculating the hash:

  1. Is the file size the same? If not, go to the next check.
  2. Are the first bunch of bytes the same? If not, go to the next check.
  3. At this point you have to check the hashes (MD5).

Of course you will have to store size/first X bytes/hash for each processed file.

In addition, same MD5 doesn't mean the files are the same so you might want to take an extra step to check if they're really the same, but this might be an overkill, depends on how heavy the cost of reprocessing the file is, might be more important not to calculate expensive hashes.

EDIT: The second question: is likely to fail as you are comparing the reference of two byte arrays that will never be the same as you create a new one every time, you need to create a sequence equal comparison between byte[]. (Or convert the hash to a string and compare strings then)

var exists = Bytelist.Any(hash => hash.SequenceEqual(myHash));

Upvotes: 3

g.schroeter
g.schroeter

Reputation: 24

You have to compare the byte arrays item by item:

foreach (var item in Bytelist)
                {
                    //If seen before ignore
                    if (myHash.Length == item.Length)
                    {
                        bool isequal = true;
                        for (int i = 0; i < myHash.Length; i++)
                        {
                            if (myHash[i] != item[i])
                            {
                                isequal = false;
                            }
                        }
                        if (isequal)
                        {
                            return true;
                        }

                    }                   
                }

Upvotes: 0

Avner Shahar-Kashtan
Avner Shahar-Kashtan

Reputation: 14700

There's a lot of room for improvement with regard to efficiency, effectiveness and style, but this isn't CodeReview.SE, so I'll try to stick the problem at hand:

You're checking if a two byte arrays are equivalent by using the == operator. But that will only perform reference equality testing - i.e. test if the two variables point to the same instance, the very same array. That, of course, won't work here.

There are many ways to do it, starting with a simple foreach loop over the arrays (with an optimization that checks the length first, probably) or using Enumerable.SequenceEquals as you can find in this answer here.

Better yet, convert your hash's byte[] to a string (any string - Convert.ToBase64String would be a good choice) and store that in your Bytelist cache (which should be a Hashset, not a List). Strings are optimized for these sort of comparisons, and you won't run into the "reference equality" problem here.

So a sample solution would be this:

private static readonly HashSet<string> _computedHashes = new HashSet<string>();

    public static bool DoCheck(string file)
    {            
        /// stuff 
        //Get byte sig of file and if seen before dont process
        byte[] myFileData = File.ReadAllBytes(file);
        byte[] myHash = MD5.Create().ComputeHash(myFileData);
        string hashString = Convert.ToBase64String(myHash);

        return _computedHashes.Contains(hashString);
    }

Presumably, you'll add the hash to the _computedHashes set after you've done the conversion.

Upvotes: 1

MoustafaS
MoustafaS

Reputation: 2031

  • Are you sure this new file format doesn't add extra meta data into the content? like last modified, or attributes that change ?
  • Also, if you are converting to a known format, then there should be a way using a file signature to know if its already in this format or not, if this is your format, then add some extra bytes for signature to identify it.
  • Don't forget that if your app gets closed and opened again it will reporcess all files again by your approach.
  • Another last point regarding the code, I prefer not storing byte arrays, but if you should, its better you create HashSet instead of list, it has an access time of O(1).

Upvotes: 1

Related Questions