Teja554
Teja554

Reputation: 81

Powershell to display duplicate files

I have a task to check if new files are imported for the day in a shared location folder and alert if any duplicate files and no recursive check needed.

Below code displays all the file details with size which are 1 day old However I need only files with the same size as I cannot compare them using name.

$Files = Get-ChildItem -Path E:\Script\test |
Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)}

$Files | Select-Object -Property Name, hash, LastWriteTime, @{N='SizeInKb';E={[double]('{0:N2}' -f ($_.Length/1kb))}}

Upvotes: 8

Views: 19598

Answers (4)

Hashbrown
Hashbrown

Reputation: 13023

I didn't like the big DOS-like script answer written here, so here's an idiomatic way of doing it for Powershell:

From the folder you want to find the duplicates, just run this simple set of pipes

Get-ChildItem -Recurse -File `
| Group-Object -Property Length `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group } `
| Get-FileHash `
| Group-Object -Property Hash `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group }

Which will show all files and their hashes that match other files.
Each line does the following:

  • get files
    • from current directory (use -Path $directory otherwise)
    • recursively (if not wanted, remove -Recurse)
  • group based on file size
  • discard groups with less than 2 files
  • grab all those files
  • get hashes for each
  • group based on hash
  • discard groups with less than 2 files
  • get all those files

Add | %{ $_.path } to just show the paths instead of the hashes.
Add | %{ $_.path -replace "$([regex]::escape($(pwd)))",'' } to only show the relative path from the current directory (useful in recursion).

For the question-asker specifically, don't forget to whack in | Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)} after the gci so you're not comparing files you don't want to consider, which might get very time-consuming if you have a lot of coincidentally same-length files in that shared folder.

Finally, if you're like me and just wanted to find dupes based on name, as google will probably take you here too:

gci -Recurse -file | Group-Object name | Where-Object { $_.Count -gt 1 } | select -ExpandProperty group | %{ $_.fullname }

Upvotes: 31

tukan
tukan

Reputation: 17347

All the examples here take in account only timestamp, lenght and name. That is for sure not enough.

Imagine this example You have two files: c:\test_path\test.txt and c:\test_path\temp\text.txt. The first one contains 12345. The second contains 54321. In this case these files will be considered identical even when they are not.

I have create a duplicate checker based on hash calculation. It was created right now from my head so it is rather crude (but I think you get the idea and it will be easy to optimize):

Edit I've decided the source code was "too crude" (nick name for incorrect) and I have improved it (removed superfluous code):

# The current directory where the script is executed
$path = (Resolve-Path .\).Path

$hash_details = @{}
$duplicities = @{}    

# Remove unique record by size (different size = different hash)
# You can select only those you need with e.g. "*.jpg"
$file_names = Get-ChildItem -path $path -Recurse -Include "*.*" | ? {( ! $_.PSIsContainer)} | Group Length | ? {$_.Count -gt 1} | Select -Expand Group | Select FullName, Length 

# I'm using SHA256 due to SHA1 collisions found
$hash_details =  ForEach ($file in $file_names) {
                             Get-FileHash -Path $file.Fullname -Algorithm SHA256
                         }

# just counter for the Hash table key
$counter = 0
ForEach ($first_file_hash in $hash_details) {
    ForEach ($second_file_hash in $hash_details) {
        If (($first_file_hash.hash -eq $second_file_hash.hash) -and ($first_file_hash.path -ne $second_file_hash.path)) {
                $duplicities.add($counter, $second_file_hash)
                $counter += 1
        }
    }
}

##Throw output with duplicity files 
If ($duplicities.count -gt 0) { 
    #Write-Output $duplicities.values
    Write-Output "Duplicate files found:" $duplicities.values.Path
    $duplicities.values | Out-file -Encoding UTF8 duplicate_log.txt
} Else {
    Write-Output 'No duplicities found'
}

I have created a test structure:

PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> Get-ChildItem -path $path -Recurse


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----          9.4.2018      9:58            test
-a---          9.4.2018     11:06       2067 check_for_duplicities.ps1
-a---          9.4.2018     11:06        757 duplicate_log.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----          9.4.2018      9:58            identical_file
d----          9.4.2018      9:56            t
-a---          9.4.2018      9:55          5 test.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---          9.4.2018      9:55          5 test.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\t


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---          9.4.2018      9:55          5 test.txt

(Where file in the ..\duplicities\test\t is different from the others).

The result of the running script.

The console output:

PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> .\check_for_duplicities.ps1
Duplicate files found:
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt

The duplicate_log.txt file contains more detailed information:

Algorithm       Hash                                                                   Path                                                                                              
---------       ----                                                                   ----                                                                                              
SHA256          5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5       C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt             
SHA256          5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5       C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt                            

Conclusion

As you see the different file is correctly omitted from the result set.

Upvotes: 2

DKU
DKU

Reputation: 75

This might helpful for you.

 $files = Get-ChildItem 'E:\SC' | Where-Object {$_.CreationTime -eq (Get-Date).AddDays(-1)} | Group-Object -Property Length
foreach($filegroup in $allfiles)
{
if ($filegroup.Count -ne 1)
{
    foreach ($file in $filegroup.Group)
    {
        Invoke-Item $file.fullname
    }
}
}

Upvotes: -1

postanote
postanote

Reputation: 16096

Since the file contents that you are determining to be duplicate. It's more prudent to just hash files and compare the hash.

The name, size. timestamp would not be a prudent attributes for the defined use case. Since the hash would tell you if the files have the same content.

See these discussions

Need a way to check if two files are the same? Calculate a hash of the files. Here is one way to do it: https://blogs.msdn.microsoft.com/powershell/2006/04/25/duplicate-files

Duplicate File Finder and Remover

And now the moment you have been waiting for....an all PowerShell file duplicate finder and remover! Now you can clean up all those copies of pictures, music files, and videos. The script opens a file dialog box to select the target folder, recursively scans each file for duplica

https://gallery.technet.microsoft.com/scriptcenter/Duplicate-File-Finder-and-78f40ae9

Upvotes: -1

Related Questions