Reputation: 81
I have a task to check if new files are imported for the day in a shared location folder and alert if any duplicate files and no recursive check needed.
Below code displays all the file details with size which are 1 day old However I need only files with the same size as I cannot compare them using name.
$Files = Get-ChildItem -Path E:\Script\test |
Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)}
$Files | Select-Object -Property Name, hash, LastWriteTime, @{N='SizeInKb';E={[double]('{0:N2}' -f ($_.Length/1kb))}}
Upvotes: 8
Views: 19598
Reputation: 13023
I didn't like the big DOS-like script answer written here, so here's an idiomatic way of doing it for Powershell:
From the folder you want to find the duplicates, just run this simple set of pipes
Get-ChildItem -Recurse -File `
| Group-Object -Property Length `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group } `
| Get-FileHash `
| Group-Object -Property Hash `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group }
Which will show all files and their hashes that match other files.
Each line does the following:
-Path $directory
otherwise)-Recurse
)Add | %{ $_.path }
to just show the paths instead of the hashes.
Add | %{ $_.path -replace "$([regex]::escape($(pwd)))",'' }
to only show the relative path from the current directory (useful in recursion).
For the question-asker specifically, don't forget to whack in | Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)}
after the gci
so you're not comparing files you don't want to consider, which might get very time-consuming if you have a lot of coincidentally same-length files in that shared folder.
Finally, if you're like me and just wanted to find dupes based on name, as google will probably take you here too:
gci -Recurse -file | Group-Object name | Where-Object { $_.Count -gt 1 } | select -ExpandProperty group | %{ $_.fullname }
Upvotes: 31
Reputation: 17347
All the examples here take in account only timestamp, lenght and name. That is for sure not enough.
Imagine this example
You have two files:
c:\test_path\test.txt
and c:\test_path\temp\text.txt
.
The first one contains 12345
. The second contains 54321
. In this case these files will be considered identical even when they are not.
I have create a duplicate checker based on hash calculation. It was created right now from my head so it is rather crude (but I think you get the idea and it will be easy to optimize):
Edit I've decided the source code was "too crude" (nick name for incorrect) and I have improved it (removed superfluous code):
# The current directory where the script is executed
$path = (Resolve-Path .\).Path
$hash_details = @{}
$duplicities = @{}
# Remove unique record by size (different size = different hash)
# You can select only those you need with e.g. "*.jpg"
$file_names = Get-ChildItem -path $path -Recurse -Include "*.*" | ? {( ! $_.PSIsContainer)} | Group Length | ? {$_.Count -gt 1} | Select -Expand Group | Select FullName, Length
# I'm using SHA256 due to SHA1 collisions found
$hash_details = ForEach ($file in $file_names) {
Get-FileHash -Path $file.Fullname -Algorithm SHA256
}
# just counter for the Hash table key
$counter = 0
ForEach ($first_file_hash in $hash_details) {
ForEach ($second_file_hash in $hash_details) {
If (($first_file_hash.hash -eq $second_file_hash.hash) -and ($first_file_hash.path -ne $second_file_hash.path)) {
$duplicities.add($counter, $second_file_hash)
$counter += 1
}
}
}
##Throw output with duplicity files
If ($duplicities.count -gt 0) {
#Write-Output $duplicities.values
Write-Output "Duplicate files found:" $duplicities.values.Path
$duplicities.values | Out-file -Encoding UTF8 duplicate_log.txt
} Else {
Write-Output 'No duplicities found'
}
I have created a test structure:
PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> Get-ChildItem -path $path -Recurse
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 9.4.2018 9:58 test
-a--- 9.4.2018 11:06 2067 check_for_duplicities.ps1
-a--- 9.4.2018 11:06 757 duplicate_log.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 9.4.2018 9:58 identical_file
d---- 9.4.2018 9:56 t
-a--- 9.4.2018 9:55 5 test.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 9.4.2018 9:55 5 test.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\t
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 9.4.2018 9:55 5 test.txt
(Where file in the ..\duplicities\test\t is different from the others).
The result of the running script.
The console output:
PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> .\check_for_duplicities.ps1
Duplicate files found:
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt
The duplicate_log.txt file contains more detailed information:
Algorithm Hash Path
--------- ---- ----
SHA256 5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5 C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
SHA256 5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5 C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt
Conclusion
As you see the different file is correctly omitted from the result set.
Upvotes: 2
Reputation: 75
This might helpful for you.
$files = Get-ChildItem 'E:\SC' | Where-Object {$_.CreationTime -eq (Get-Date).AddDays(-1)} | Group-Object -Property Length
foreach($filegroup in $allfiles)
{
if ($filegroup.Count -ne 1)
{
foreach ($file in $filegroup.Group)
{
Invoke-Item $file.fullname
}
}
}
Upvotes: -1
Reputation: 16096
Since the file contents that you are determining to be duplicate. It's more prudent to just hash files and compare the hash.
The name, size. timestamp would not be a prudent attributes for the defined use case. Since the hash would tell you if the files have the same content.
See these discussions
Need a way to check if two files are the same? Calculate a hash of the files. Here is one way to do it: https://blogs.msdn.microsoft.com/powershell/2006/04/25/duplicate-files
Duplicate File Finder and Remover
And now the moment you have been waiting for....an all PowerShell file duplicate finder and remover! Now you can clean up all those copies of pictures, music files, and videos. The script opens a file dialog box to select the target folder, recursively scans each file for duplica
https://gallery.technet.microsoft.com/scriptcenter/Duplicate-File-Finder-and-78f40ae9
Upvotes: -1