PowerShell Detect Duplicate Files

Question

I have a PowerShell script (see below) that locates any duplicate files found outside of the input path, and if any are found, e-mails an attachment with the information.

It worked on my personal machine, and I am currently testing it on our server. I never expected it to be fast, but I am currently an hour in to the test and it still is not finished!

My question is then, is there anything I can do to reduce the time it takes to run?

As an additional question: I encrypted the password file via PowerShell... is it possible for someone who has access to the file to decrypt it and view the password in plain text?

Any help would be appreciated!

$sourcepath = "\server1\privatetest\"
$duplicatepath = "\server1\public\"
$dup_found = 0

function Send-ToEmail([string]$email, [string]$attachmentpath){
    $message = new-object Net.Mail.MailMessage;
    $message.From = "MyEmail@MyDomain.com";
    $message.To.Add($email);
    $message.Subject = "Duplicate Found";
    $message.Body = "Please see attachment";
    $attachment = New-Object Net.Mail.Attachment($attachmentpath);
    $message.Attachments.Add($attachment);

    $smtp = new-object Net.Mail.SmtpClient("smtp.gmail.com", "587");
    $smtp.EnableSSL = $true;
    $smtp.Credentials = New-Object System.Net.NetworkCredential($Username, $Password);
    $smtp.send($message);
    $attachment.Dispose();
 }

If ((Test-Path $sourcepath) -AND (Test-Path $duplicatepath)) {
    $sourcefiles = Get-ChildItem $sourcepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash  
    $dupfiles = Get-ChildItem $duplicatepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash 
    $duplicates = [System.Collections.ArrayList]@()
    
    If (($sourcefiles.count -eq 0) -or ($dupfiles.count -eq 0)) {
        If ($sourcefiles.count -eq 0) {
            Write-Warning 'No files found in source path'
        }
        else {
            Write-Warning 'No files found in duplicate path'
        }
        Break
    }
    else {
        foreach ($sf in $sourcefiles) {
            $result1path = $sf | Select -Property Path
            $result1hash = $sf | Select -Property Hash

            foreach ($df in $dupfiles) {
                $result2path = $df | Select -Property Path
                $result2hash = $df | Select -Property Hash

                If (($result1hash) -like ($result2hash)) {
                    $dup_found = 1
                    $dupmsg = 'Source Path: '
                    $dupmsg = $dupmsg + $result1path
                    $dupmsg = $dupmsg + ', Source Hash: '
                    $dupmsg = $dupmsg + $result1hash
                    $dupmsg = $dupmsg + ', Duplicate Path: '
                    $dupmsg = $dupmsg + $result2path
                    $dupmsg = $dupmsg + ', Duplicate Hash: '
                    $dupmsg = $dupmsg + $result2hash

                    $duplicates = $duplicates + $dupmsg
                }
            }
        }
        
        if ($dup_found -eq 1) {
            $Username = "MyEmail@MyDomain.com";
            $pwfile = Get-Content "PasswordFile"
            $Password = $pwfile | ConvertTo-SecureString
            $path = "C:	emp\duplicates.txt";
            $duplicates | Out-File -FilePath C:	emp\duplicates.txt
            Send-ToEmail  -email "MyEmail@MyDomain.com" -attachmentpath $path;
            Remove-Item C:	emp\duplicates.txt
        }
    }
}
else {
    If(!(Test-Path $sourcepath)) {
        Write-Warning 'Source path not found'
    }
    elseif(!(Test-Path $duplicatepath)) {
        Write-Warning 'Duplicate path not found'
    }
}

Mathias R. Jessen · Accepted Answer

[...], is there anything I can do to reduce the time it takes to run?

Yes, there most certainly is!

Reduce the runtime complexity

What you have here is a classic performance-gotcha - by comparing each file in one collection to every other file in the other, you've created a quadratic algorithm.

What does quadratic mean? It means that for N input items in each collection, you now have to perform N^2 comparisons - so if each directory contains 1 file, you only need one comparison - but with 2 files, you need 4 comparisons, 3 files = 9 comparisons, etc. - already at just 100 files in each directory, you'll need to make 10.000(!) comparisons.

Instead, you'll want to use a data structure that's fast at determining whether a specific value is contained within it or not. For this purpose, you could use a hash table:

# Create a hashtable
$sourceFileIndex = @{}

# Use source files to populate the hashtable - we'll use the calculate hash as the key
$sourcefiles = Get-ChildItem $sourcepath -File -Recurse -ErrorAction SilentlyContinue |ForEach-Object {
    $hashed = $_ |Get-FileHash
    $sourceFileIndex[$hashed.Hash] = $hashed
}

# Keep the potential duplicates in an array, no need to change anything here
$dupfiles = Get-ChildItem $duplicatepath -File -Recurse -ErrorAction SilentlyContinue | Get-FileHash 

#...

# Now we can remove the outer loop completely
foreach ($df in $dupfiles) {
    # Here's the magic - replace the string comparison with a call to ContainsKey()
    if ($sourceFileIndex.ContainsKey($df.Hash)) {
        $dup_found = 1
        $dupmsg = 'Source Path: '
        $dupmsg = $dupmsg + $result1path
        $dupmsg = $dupmsg + ', Source Hash: '
        $dupmsg = $dupmsg + $result1hash
        $dupmsg = $dupmsg + ', Duplicate Path: '
        $dupmsg = $dupmsg + $result2path
        $dupmsg = $dupmsg + ', Duplicate Hash: '
        $dupmsg = $dupmsg + $result2hash

        $duplicates = $duplicates + $dupmsg
    }
}

This should already give you a massive performance boost.

Reduce string manipulation to a minimum

Another costly aspect of your current approach (although not as significant as the problem described above) is the constant string concatenation - the runtime needs to re-allocate memory for all the individual little substrings and this can eventually take a toll on execution time when processing high volumes of data.

One way to reduce string manipulation is by creating structured objects instead of maintaining a running "output string":

foreach ($df in $dupfiles) {
    # Here's the magic - replace the string comparison with a call to ContainsKey()
    if ($sourceFileIndex.ContainsKey($df.Hash)) {
        $dup_found = 1

        # Create output object
        $dupeRecord = [pscustomobject]@{
            SourcePath = $sourceFileIndex[$df.Hash].Path
            SourceHash = $df.Hash # these are identical, no need to fetch the "source hash"
            DuplicatePath = $df.Path
            DuplicateHash = $df.Hash
        }

        [void]$duplicates.Add($dupeRecord)
    }
}

This brings about another improvement! Since these are objects (as opposed to raw strings), you now have greater choice/flexibility when it comes to output formatting:

# Want an HTML table? Go ahead!
$duplicates |ConvertTo-Html -As Table |Out-File .\path	o\attachment.html

PowerShell Detect Duplicate Files

Answers (1)

Reduce the runtime complexity

Reduce string manipulation to a minimum

Related Questions