bman
bman

Reputation: 80

Finding duplicate file names in Powershell

I am a PowerShell noob looking for a way to find duplicate files in a directory and write the file paths of the files to a text file or csv file. My current code is working, but is extremely inefficient and slow. Any recommendations would be greatly appreciated

#Declaring the Array to store file paths and names
$arr = (get-childitem "My Path" -recurse | where {$_.extension -like '*.*'})

#creating an array to hold already found duplicate elements in order to skip over them in the iteration
$arrDupNum = -1

#Declaring for loop to iterate over the array
For ($i=0; $i -le $arr.Length - 1; $i++) {
    $percent = $i / $arr.Length * 100
    Write-Progress -Activity "ActivityString" -Status "StatusString" -PercentComplete $percent -CurrentOperation "CurrentOperationString"
    
    $trigger = "f"
    
    For ($j = $i + 1; $j -le $arr.Length - 1; $j++)
    {
        foreach ($num in $arrDupNum)
        {
            #if statement to skip over duplicates already found
            if($num -eq $j -and $j -le $arr.Length - 2)
            {
                $j = $j + 1
            }            
        }

        if ($arr[$j].Name -eq $arr[$i].Name)
            {
                $trigger = "t"
                Add-Content H:\Desktop\blank.txt ($arr[$j].FullName + "; " + $arr[$i].FullName)
                Write-Host $arr[$i].Name
                $arrDupNum += $j
            }
    }
    #trigger used for formatting the text file in csv format
    if ($trigger -eq "t")
    {
    Add-Content H:\Desktop\blank.txt (" " + "; " + " ")
    }
}

Upvotes: 1

Views: 5160

Answers (2)

Lance U. Matthews
Lance U. Matthews

Reputation: 16606

The other answer tackles the most significant improvement you can make, but there's a couple other tweaks that might improve performance.

When you use Where-Object to filter by the Extension property, that filtering is done in PowerShell itself. For a simple pattern like you're using, you can have a lower-level API do the filtering using the -Filter parameter of Get-ChildItem...

$arr = (get-childitem "My Path" -recurse -Filter '*.*')

That pattern, of course, is specifically filtering for entries whose name contain a .. If you meant it as a DOS-style "all files" pattern, you could use '*' or, better yet, just omit the filter entirely. On the subject of "all files", it's important to point out that Get-ChildItem does not include hidden files by default. To include those in your search, use the -Force parameter...

$arr = (get-childitem "My Path" -recurse -Filter '*.*' -Force)

Also, be aware that Get-ChildItem will return both file and directory objects from a filesystem. That is, the code in the question will look at directory names, too, in its search for duplicates. If, as the question suggests, you want to restrict it to files you can use the -File parameter of Get-ChildItem...

$arr = (get-childitem "My Path" -recurse -Filter '*.*' -File)

Note that parameter first became available in PowerShell 3.0, but as that is several versions old I'm sure it will work for you.

Upvotes: 1

Mathias R. Jessen
Mathias R. Jessen

Reputation: 174485

Use a hashtable to group the files by name:

$filesByName = @{}

foreach($file in $arr){
    $filesByName[$file.Name] += @($file)
}

Now we just need to find all hashtable entries with more than one file:

foreach($fileName in $filesByName.Keys){
    if($filesByName[$fileName].Count -gt 1){
        # Duplicates found!
        $filesByName[$fileName] |Select -Expand FullName |Add-Content .\duplicates.txt
    }
}

This way, when you have N files, you'll at most iterate over them N*2 times, instead of N*N times :)

Upvotes: 2

Related Questions