Reputation: 613
I have two files, file1
and file2
. I need to check if all the contents in file1
are present in file2
.
Contents of the file1
will be as following:
ABC1234 BFD7890
And contents of file2
will be as following:
ABC1234_20180902_XYZ BFD7890_20110890_123
They will not be in any specific order, and it is not possible to split by delimiter, as they are different in different lines. Only thing I need to confirm is that if string from file1
is present in some part of file2
. There will not be two occurrence of the same pattern.
Both files contains more than 20000 lines.
This is what I currently have:
$filesfromDB = gc file1.txt
$filesfromSFTP = gc file2.txt
foreach ($f in $filesfromDB) {
$FilePresentStatus = $filesfromSFTP | Select-String -Quiet -Pattern $f
if ($FilePresentStatus -ne $true) {
$MissingFiles += $f
}
}
This works fine if the files are small, but when I run this in prod, it is really slow. It takes around 4 hours to complete this loop. How do I optimize this piece of script?
Upvotes: 0
Views: 96
Reputation: 1263
Working with a hashtable, the code below takes around 15 minutes on my laptop with 2 files containing 20000 lines.
$filesfromDB = gc file1.txt
$filesfromSFTP = gc file2.txt
$MissingFiles = @()
$hashtbl = @{}
foreach ($f in $filesfromDB) {
$hashtbl."Line$($f.ReadCount)"=[regex]$f
}
foreach ($key in $hashtbl.Keys) {
$FilePresentStatus = $hashtbl[$key].Matches($filesfromSFTP)
if ($FilePresentStatus.Count -eq 0) {
$MissingFiles += $hashtbl[$key].ToString()
}
}
Upvotes: 0
Reputation: 9143
20000 is not that much, but at worst you have to do 20000x20000=400000000 operations. The key is to stop as soon as possible in each. You can also use much faster [string].Contains
method instead of regular expression based Select-String
(unless -SimpleMatch switch is used).
See following demo:
$db = 1000000..1020000
$sftp = (1001000..1021000 | % { "$($_)_SomeNotImportantTextHere" }) -join "`r`n"
$missingFiles = $db | where { !$sftp.Contains($_) }
Each collection contains 20000 items, 19000 common, 1000 exists only in $db
. It runs in couple of seconds.
To read $filesfromSFTP
as one big string, use:
gc file2.txt -Raw
To convert result to single string, use $missingFiles -join 'separator'
.
Upvotes: 1
Reputation: 1
I think your problem lies in the += operator, try this https://powershell.org/2013/09/16/powershell-performance-the-operator-and-when-to-avoid-it/
Upvotes: 0