Reputation: 80
I am a PowerShell noob looking for a way to find duplicate files in a directory and write the file paths of the files to a text file or csv file. My current code is working, but is extremely inefficient and slow. Any recommendations would be greatly appreciated
#Declaring the Array to store file paths and names
$arr = (get-childitem "My Path" -recurse | where {$_.extension -like '*.*'})
#creating an array to hold already found duplicate elements in order to skip over them in the iteration
$arrDupNum = -1
#Declaring for loop to iterate over the array
For ($i=0; $i -le $arr.Length - 1; $i++) {
$percent = $i / $arr.Length * 100
Write-Progress -Activity "ActivityString" -Status "StatusString" -PercentComplete $percent -CurrentOperation "CurrentOperationString"
$trigger = "f"
For ($j = $i + 1; $j -le $arr.Length - 1; $j++)
{
foreach ($num in $arrDupNum)
{
#if statement to skip over duplicates already found
if($num -eq $j -and $j -le $arr.Length - 2)
{
$j = $j + 1
}
}
if ($arr[$j].Name -eq $arr[$i].Name)
{
$trigger = "t"
Add-Content H:\Desktop\blank.txt ($arr[$j].FullName + "; " + $arr[$i].FullName)
Write-Host $arr[$i].Name
$arrDupNum += $j
}
}
#trigger used for formatting the text file in csv format
if ($trigger -eq "t")
{
Add-Content H:\Desktop\blank.txt (" " + "; " + " ")
}
}
Upvotes: 1
Views: 5160
Reputation: 16606
The other answer tackles the most significant improvement you can make, but there's a couple other tweaks that might improve performance.
When you use Where-Object
to filter by the Extension
property, that filtering is done in PowerShell itself. For a simple pattern like you're using, you can have a lower-level API do the filtering using the -Filter
parameter of Get-ChildItem
...
$arr = (get-childitem "My Path" -recurse -Filter '*.*')
That pattern, of course, is specifically filtering for entries whose name contain a .
. If you meant it as a DOS-style "all files" pattern, you could use '*'
or, better yet, just omit the filter entirely. On the subject of "all files", it's important to point out that Get-ChildItem
does not include hidden files by default. To include those in your search, use the -Force
parameter...
$arr = (get-childitem "My Path" -recurse -Filter '*.*' -Force)
Also, be aware that Get-ChildItem
will return both file and directory objects from a filesystem. That is, the code in the question will look at directory names, too, in its search for duplicates. If, as the question suggests, you want to restrict it to files you can use the -File
parameter of Get-ChildItem...
$arr = (get-childitem "My Path" -recurse -Filter '*.*' -File)
Note that parameter first became available in PowerShell 3.0, but as that is several versions old I'm sure it will work for you.
Upvotes: 1
Reputation: 174485
Use a hashtable to group the files by name:
$filesByName = @{}
foreach($file in $arr){
$filesByName[$file.Name] += @($file)
}
Now we just need to find all hashtable entries with more than one file:
foreach($fileName in $filesByName.Keys){
if($filesByName[$fileName].Count -gt 1){
# Duplicates found!
$filesByName[$fileName] |Select -Expand FullName |Add-Content .\duplicates.txt
}
}
This way, when you have N
files, you'll at most iterate over them N*2
times, instead of N*N
times :)
Upvotes: 2