user1003713
user1003713

Reputation: 21

Searching Thousands of Files in a Directory using a CSV List

I'm using Powershell 5.0 and I have a .CSV file with a list siebelid that I want to search for (approx: 5000) and I want to search through each folder and subfolder on a server for any file that contains that list item (siebelid) in the file name. i.e filename: 32444167.pdf or 32444167.pdf.metadata.properties.xml

Example CSV file:

32444167,ACME,4/15/2013
27721071,ACME,4/15/2013
27721072,ACME,4/15/2013

I am filtering on *.PDF and *.XML. Then I want to copy the found files to a destination folder on the same server. The problem is, I have hundreds of thousands of files in the folder and subfolders. The Code I wrote seems to take a long time to run upto several days. I am not an expert and believe I have not written the most efficient Powershell script. Any help would be appreciated.

Basically, the code works but it extremely slow when processing through a folder that has hundreds of thousands of files. It seems in efficient to call the Get-Childitem each time I'm getting a new item from the list.

$PDFExtension = '.pdf'
$XMLExtension = '.pdf.metadata.properties.xml'
$source = 'C:\Temp\CSVtoXML'
$destination = 'C:\Temp\FindFiles\'                                                           #' 
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$log = $destination + "FileCopyLog.txt"

$FileList = import-csv “C:\Temp\FindFiles\test.csv” -Delimiter "," -Header 'siebelId', 'companyCode', 'receivedDate'
$GetFiles = @(Get-ChildItem -path $source -Recurse -File -include *.xml, *.pdf ) | select -First 100000

ForEach ($item in $FileList){
 $siebelId = $($item.siebelId) + $PDFExtension
 $XMLFile = $($item.siebelId) + $XMLExtension

 $FilterFiles = @($GetFiles) | Where-Object {$_.name -eq $siebelId -or $_.name -eq $XMLFile} #|  Out-File $destination"FileCopyLog.csv"
 #write-host "Filtered Files: " $FilterFiles

 ForEach ($file in $FilterFiles){

   $fileBase = $file.BaseName
   $fileExt = $file.Extension

   write-host "file: " $fileBase$fileExt

   If (-not ([string]::IsNullOrEmpty($file))) {
       if(!(Test-Path -Path $Destination$fileBase$fileExt)) {
            copy-item $file -destination $destination   # Copies files
            write-host "File: [" $file "] has Been Copied! to " $Destination `n`r -ForegroundColor yellow
            $strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
            $LogValue = $strGetDate + ': ' + "Source: [" + $file + "] Destination: " + $Destination
            Add-Content -Path $log -Value $LogValue
       } else
       {
            write-host "File: [" $file "] already exsits in destination folder" `n`r -ForegroundColor yellow
            $strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
            $LogValue = $strGetDate + ': ' + "File: [" + $file + "] already exsits in destination folder! "
            Add-Content -Path $log -Value $LogValue 
       }

   }else{
       write-host "No File was copied!" `n`r -ForegroundColor red
   }
 }
}

write-host 'Script has completed' -ForegroundColor green


The expected results I'm looking for is to have this process within a couple of hours rather than several days.

Upvotes: 1

Views: 1013

Answers (3)

Ben Personick
Ben Personick

Reputation: 3264

Instead of looping the files, filter them.

Amended to use ".pdf.metadata.properties.xml" instead of XML, and mattch those by srtipping '.pdf.metadata.properties' from the "Basename" of the files we found

edit

Also put in more of your script to reduce time taken in the copy process, by generating a list of destination files and then filtering the files we'l copy by fi



$Exts =@('.pdf','.pdf.metadata.properties.xml')

$source = 'C:\Temp\CSVtoXML'
$destination = 'C:\Temp\FindFiles\'                                                           #' 
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$log = "$($destination)FileCopyLog.txt"

$SiebelIDFile="$($destination)test.csv"
$SiebelIDImport = import-csv $SiebelIDFile -Delimiter "," -Header 'siebelId', 'companyCode', 'receivedDate'

$SRC_Matched_Exts = $(  $Exts | % { Get-ChildItem -path $source -Recurse -File -Filter $_  } )


# Presto we can filter the list using the Siebel IDs


$Results = $SRC_Matched_Exts | ? { $( $($_.basename) -replace '.pdf.metadata.properties','' ) -in $($SiebelIDImport.SiebelID) }

# Confirm results by outputting first 1000
$Results | select -first 100 | FT -property BaseName, FullName -Auto 

# Get Destination Files to compare:
$Dst_Matched_Exts = $(  $Exts | % { Get-ChildItem -path $Destiation -Recurse -File -Filter $_  } )

# Filter to only the Source files notin the destination:
$Src_Files_MissingFromDst = $Results | ? { $_.basename -notin $( $Dst_Matched_Exts.basename ) }
$Src_Files_AlreadyInDs = $Results | ? { $_.basename -notin $Src_Files_MissingFromDst.basename }


# Output some of the Files we won't Copy because they already exist in dst:
Write-host "
 Output some of the Files we won't Copy because they already exist in dst:

$($Src_Files_AlreadyInDst | select -first 100 | FT -property BaseName, FullName -Auto | Out-String)" -ForegroundColor red

# Output some of the Files we will Copy:
Write-host "
 Output some of the Files we will Copy:

$Src_Files_MissingFromDst | select -first 100 | FT -property BaseName, FullName -Auto | Out-String )" -ForegroundColor yellow

$Count=0
# Loop Files and Copy them to Destination:
$Src_Files_MissingFromDst | %{
  $Count+=1
  copy-item $($_.Fullname) -destination $destination   # Copies files
  Add-Content -Path $log -Value "$(Get-Date -UFormat `"%Y-%m-%d %H:%M:%S`")`: Source File # $Count: [$($file)] Destination: $Destination"
  # Update the copy progress every 10 files
  IF ( ! [bool]( $Count % 10 ) -or $Count -eq $($Src_Files_MissingFromDst.count)  ) {
    Write-Progress -Activity "======== Copying to $Destination" -Status "## $([math]::round( $(($Count/$($Src_Files_MissingFromDst.count))*100), 1))% Complete!" -PercentComplete $([math]::round( $(($Count/$($Src_Files_MissingFromDst.count))*100), 1))
    write-host "File # $Count: [ $file ] has Been Copied to  $Destination " -ForegroundColor Green
  }

}

now you can write your filecopy/move based off the collection of matched file - and it would make sense to use a parallel process to speed that up.

Loops are always slower than filtering by select statements, also using the in-line filter on the command is almost always a better path than filtering the results as the filtering happens at the lower level while collecting the data.

Upvotes: 1

user6811411
user6811411

Reputation:

As the siebelID seems to have 8 digits, you could use that to select files.

I'm unsure what's more efficient:

  • crawling the tree twice (for each extension) or
  • only once using a Where-Object and a Regular Expression which in one go extracts the number and checks for presence in $Filelist

The output should be reduced to the absolute necessary to speed up processing.

The following script also removes redundancy in creating $LogValue

## Q:\Test\2019\08\26\SO_57658091.ps1
$source = 'Q:\Test\2019' # 'C:\Temp\CSVtoXML'    # 
$target = 'A:\Test\2019' # 'C:\Temp\FindFiles\'  # 
$log = Join-Path $target  "FileCopyLog.txt"

$RE = '^(?<siebelID>\d{8})\.pdf(\.metadata\.properties\.xml)?'
$FileList = Import-Csv "C:\Temp\FindFiles\test.csv" -Header siebelId,companyCode,receivedDate

Get-ChildItem -path $source -Recurse -File -Filter '*.pdf*' |
  Where-Object {($_.Name -match $RE ) -and
                ($Matches.siebelID -in $FileList.siebelID)} | 
ForEach-Object{
    if(!(Test-Path (Join-Path $target $_.Name))) {
        Copy-Item $_.FullName -Destination $target   # Copies files
        $Copied = 'copied to {0}' -f $target
    } else {
        $Copied = 'present in destination'
    }
    $LogValue = '{0}: File: [{1}] {2}' -f (Get-Date -UFormat "%Y-%m-%d %H:%M:%S"),$_.Name,$Copied
    # $LogValue  # optionally output, but that slows down.
    Add-Content -Path $log -Value $LogValue 
}

write-host 'Script has completed' -ForegroundColor green

A slightly adapted version to search through my test folder with stored SO scripts which happen to also have an 8 digit number yields this FileCopyLog.txt

2019-08-26 17:46:03: File: [SO_55464728.ps1] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55569099.ps1] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55575835.cmd] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55575543.ps1] copied to A:\Test\2019

Upvotes: 0

js2010
js2010

Reputation: 27418

Try:

$(Get-ChildItem -path $source -Recurse -File -Filter *.xml
  Get-ChildItem -path $source -Recurse -File -Filter *.pdf)

Upvotes: 0

Related Questions