Reputation: 21
I'm using Powershell 5.0 and I have a .CSV
file with a list siebelid that I want to search for (approx: 5000) and I want to search through each folder and subfolder on a server for any file that contains that list item (siebelid) in the file name. i.e filename: 32444167.pdf or 32444167.pdf.metadata.properties.xml
Example CSV file:
32444167,ACME,4/15/2013
27721071,ACME,4/15/2013
27721072,ACME,4/15/2013
I am filtering on *.PDF
and *.XML
. Then I want to copy the found files to a destination folder on the same server. The problem is, I have hundreds of thousands of files in the folder and subfolders. The Code I wrote seems to take a long time to run upto several days. I am not an expert and believe I have not written the most efficient Powershell script. Any help would be appreciated.
Basically, the code works but it extremely slow when processing through a folder that has hundreds of thousands of files. It seems in efficient to call the Get-Childitem
each time I'm getting a new item from the list.
$PDFExtension = '.pdf'
$XMLExtension = '.pdf.metadata.properties.xml'
$source = 'C:\Temp\CSVtoXML'
$destination = 'C:\Temp\FindFiles\' #'
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$log = $destination + "FileCopyLog.txt"
$FileList = import-csv “C:\Temp\FindFiles\test.csv” -Delimiter "," -Header 'siebelId', 'companyCode', 'receivedDate'
$GetFiles = @(Get-ChildItem -path $source -Recurse -File -include *.xml, *.pdf ) | select -First 100000
ForEach ($item in $FileList){
$siebelId = $($item.siebelId) + $PDFExtension
$XMLFile = $($item.siebelId) + $XMLExtension
$FilterFiles = @($GetFiles) | Where-Object {$_.name -eq $siebelId -or $_.name -eq $XMLFile} #| Out-File $destination"FileCopyLog.csv"
#write-host "Filtered Files: " $FilterFiles
ForEach ($file in $FilterFiles){
$fileBase = $file.BaseName
$fileExt = $file.Extension
write-host "file: " $fileBase$fileExt
If (-not ([string]::IsNullOrEmpty($file))) {
if(!(Test-Path -Path $Destination$fileBase$fileExt)) {
copy-item $file -destination $destination # Copies files
write-host "File: [" $file "] has Been Copied! to " $Destination `n`r -ForegroundColor yellow
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$LogValue = $strGetDate + ': ' + "Source: [" + $file + "] Destination: " + $Destination
Add-Content -Path $log -Value $LogValue
} else
{
write-host "File: [" $file "] already exsits in destination folder" `n`r -ForegroundColor yellow
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$LogValue = $strGetDate + ': ' + "File: [" + $file + "] already exsits in destination folder! "
Add-Content -Path $log -Value $LogValue
}
}else{
write-host "No File was copied!" `n`r -ForegroundColor red
}
}
}
write-host 'Script has completed' -ForegroundColor green
The expected results I'm looking for is to have this process within a couple of hours rather than several days.
Upvotes: 1
Views: 1013
Reputation: 3264
Amended to use ".pdf.metadata.properties.xml" instead of XML, and mattch those by srtipping '.pdf.metadata.properties' from the "Basename" of the files we found
edit
Also put in more of your script to reduce time taken in the copy process, by generating a list of destination files and then filtering the files we'l copy by fi
$Exts =@('.pdf','.pdf.metadata.properties.xml')
$source = 'C:\Temp\CSVtoXML'
$destination = 'C:\Temp\FindFiles\' #'
$strGetDate = get-date -UFormat “%Y-%m-%d %H:%M:%S”
$log = "$($destination)FileCopyLog.txt"
$SiebelIDFile="$($destination)test.csv"
$SiebelIDImport = import-csv $SiebelIDFile -Delimiter "," -Header 'siebelId', 'companyCode', 'receivedDate'
$SRC_Matched_Exts = $( $Exts | % { Get-ChildItem -path $source -Recurse -File -Filter $_ } )
# Presto we can filter the list using the Siebel IDs
$Results = $SRC_Matched_Exts | ? { $( $($_.basename) -replace '.pdf.metadata.properties','' ) -in $($SiebelIDImport.SiebelID) }
# Confirm results by outputting first 1000
$Results | select -first 100 | FT -property BaseName, FullName -Auto
# Get Destination Files to compare:
$Dst_Matched_Exts = $( $Exts | % { Get-ChildItem -path $Destiation -Recurse -File -Filter $_ } )
# Filter to only the Source files notin the destination:
$Src_Files_MissingFromDst = $Results | ? { $_.basename -notin $( $Dst_Matched_Exts.basename ) }
$Src_Files_AlreadyInDs = $Results | ? { $_.basename -notin $Src_Files_MissingFromDst.basename }
# Output some of the Files we won't Copy because they already exist in dst:
Write-host "
Output some of the Files we won't Copy because they already exist in dst:
$($Src_Files_AlreadyInDst | select -first 100 | FT -property BaseName, FullName -Auto | Out-String)" -ForegroundColor red
# Output some of the Files we will Copy:
Write-host "
Output some of the Files we will Copy:
$Src_Files_MissingFromDst | select -first 100 | FT -property BaseName, FullName -Auto | Out-String )" -ForegroundColor yellow
$Count=0
# Loop Files and Copy them to Destination:
$Src_Files_MissingFromDst | %{
$Count+=1
copy-item $($_.Fullname) -destination $destination # Copies files
Add-Content -Path $log -Value "$(Get-Date -UFormat `"%Y-%m-%d %H:%M:%S`")`: Source File # $Count: [$($file)] Destination: $Destination"
# Update the copy progress every 10 files
IF ( ! [bool]( $Count % 10 ) -or $Count -eq $($Src_Files_MissingFromDst.count) ) {
Write-Progress -Activity "======== Copying to $Destination" -Status "## $([math]::round( $(($Count/$($Src_Files_MissingFromDst.count))*100), 1))% Complete!" -PercentComplete $([math]::round( $(($Count/$($Src_Files_MissingFromDst.count))*100), 1))
write-host "File # $Count: [ $file ] has Been Copied to $Destination " -ForegroundColor Green
}
}
now you can write your filecopy/move based off the collection of matched file - and it would make sense to use a parallel process to speed that up.
Loops are always slower than filtering by select statements, also using the in-line filter on the command is almost always a better path than filtering the results as the filtering happens at the lower level while collecting the data.
Upvotes: 1
Reputation:
As the siebelID
seems to have 8 digits, you could use that to select files.
I'm unsure what's more efficient:
$Filelist
The output should be reduced to the absolute necessary to speed up processing.
The following script also removes redundancy in creating $LogValue
## Q:\Test\2019\08\26\SO_57658091.ps1
$source = 'Q:\Test\2019' # 'C:\Temp\CSVtoXML' #
$target = 'A:\Test\2019' # 'C:\Temp\FindFiles\' #
$log = Join-Path $target "FileCopyLog.txt"
$RE = '^(?<siebelID>\d{8})\.pdf(\.metadata\.properties\.xml)?'
$FileList = Import-Csv "C:\Temp\FindFiles\test.csv" -Header siebelId,companyCode,receivedDate
Get-ChildItem -path $source -Recurse -File -Filter '*.pdf*' |
Where-Object {($_.Name -match $RE ) -and
($Matches.siebelID -in $FileList.siebelID)} |
ForEach-Object{
if(!(Test-Path (Join-Path $target $_.Name))) {
Copy-Item $_.FullName -Destination $target # Copies files
$Copied = 'copied to {0}' -f $target
} else {
$Copied = 'present in destination'
}
$LogValue = '{0}: File: [{1}] {2}' -f (Get-Date -UFormat "%Y-%m-%d %H:%M:%S"),$_.Name,$Copied
# $LogValue # optionally output, but that slows down.
Add-Content -Path $log -Value $LogValue
}
write-host 'Script has completed' -ForegroundColor green
A slightly adapted version to search through my test folder with stored SO scripts which happen to also have an 8 digit number yields this FileCopyLog.txt
2019-08-26 17:46:03: File: [SO_55464728.ps1] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55569099.ps1] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55575835.cmd] copied to A:\Test\2019
2019-08-26 17:46:03: File: [SO_55575543.ps1] copied to A:\Test\2019
Upvotes: 0
Reputation: 27418
Try:
$(Get-ChildItem -path $source -Recurse -File -Filter *.xml
Get-ChildItem -path $source -Recurse -File -Filter *.pdf)
Upvotes: 0