Dave
Dave

Reputation: 8461

What pattern can I use to ensure I don't perform same action twice

My program is looking for duplicates. It compares a file to every other file in the folder and subfolders. The problem is, it's duplicating its check.

For example, please consider the following (crude) folder structure

-Folder1
---File1
---File2
---File3

-Folder2
---File1
---File2

-Folder3
---File1
---File2
---File3
---File4

So, just to ensure clarity it means folder 1, folder 2 and folder 3 are all at root level, within each of them are the files which live in each folder.

My program iterates through, comparing each to the other via 2 foreach loops.

 foreach (string path01 in  Directory.GetFiles(SourcePath, "*.*", SearchOption.AllDirectories))
 {
     foreach (string path02 in  Directory.GetFiles(SourcePath, "*.*", SearchOption.AllDirectories))
     {
           //perform logic with path01 and path02
     }
 }

Now, the problem with this is one of the iterations will compare Folder1\File1 to Folder2\File1 (which is desired) but it will also compare Folder2\File1 to Folder1\File1. This is inefficient as that check is already done. Now I admit, with only the files/folders I've listed above it could be argued who cares, but my application is comparing thousands of folders and I have no idea how many files.

In my head I think I have to order alphabetically, then use a for loop and always start at the next iteration to prevent the search going backwards but I'm not sure. At one point, I thought a bubble sort could help but, this isn't about sorting although may be I can or can't use this.

I am certain that this type of problem is documented and exists, the problem I'm having is, (as you can tell by the length of my post) how to describe in a Google search so I can research if a pattern exists.

So, my question is, does a pattern or paradigm already exist for such an problem?

Upvotes: 2

Views: 104

Answers (2)

Moha Dehghan
Moha Dehghan

Reputation: 18443

var files = Directory.GetFiles(SourcePath, "*.*", SearchOption.AllDirectories);
for (int i = 0; i < files.Length-1; i++)
    for (int j = i+1; j < files.Length; j++)
    {
        string path1 = files[i];
        string path2 = files[j];
        //perform logic with path1 and path2          
    }

This code performs better than your code in two ways:

  1. It does not compare each pair of files twice, as you are concerned.
  2. It calls Directory.GetFile only once.

Upvotes: 1

Ayush
Ayush

Reputation: 42450

How are you detecting duplicates? Are you only looking for a duplicate file name, or do you open the file and read the contents as well? Either ways, you should use a HashSet

var visitedFiles = new HashSet<String>();

foreach (string path01 in  Directory.GetFiles(SourcePath, "*.*", SearchOption.AllDirectories)) {
   String contents = // read in file contents
   String contentHash = md5(contents); // do a md5 hash of the contents

   if (!visitedFiles.contains(contentHash)) {
       visitedFiles.add(contentHash);
   } else {
      // duplicate file found
   }
}

This is a basic un-testsed example. You can modify it accoridng to your needs. Instead of storing Strings in the hashset, you could store a class object that holds more information (customize it to your needs).

Anyways, this solution has a time complexity of O(n) as opposed to yours which is O(n^2).

Upvotes: 2

Related Questions