screenslaver
screenslaver

Reputation: 613

Optimise looping through file contents

I have two files, file1 and file2. I need to check if all the contents in file1 are present in file2. Contents of the file1 will be as following:

ABC1234
BFD7890

And contents of file2 will be as following:

ABC1234_20180902_XYZ
BFD7890_20110890_123

They will not be in any specific order, and it is not possible to split by delimiter, as they are different in different lines. Only thing I need to confirm is that if string from file1 is present in some part of file2. There will not be two occurrence of the same pattern.

Both files contains more than 20000 lines.

This is what I currently have:

$filesfromDB   = gc file1.txt
$filesfromSFTP = gc file2.txt
foreach ($f in $filesfromDB) {
    $FilePresentStatus = $filesfromSFTP | Select-String -Quiet -Pattern $f
    if ($FilePresentStatus -ne $true) {
        $MissingFiles += $f
    }
}

This works fine if the files are small, but when I run this in prod, it is really slow. It takes around 4 hours to complete this loop. How do I optimize this piece of script?

Upvotes: 0

Views: 96

Answers (3)

Nas
Nas

Reputation: 1263

Working with a hashtable, the code below takes around 15 minutes on my laptop with 2 files containing 20000 lines.

$filesfromDB   = gc file1.txt
$filesfromSFTP = gc file2.txt
$MissingFiles  = @()
$hashtbl       = @{}

foreach ($f in $filesfromDB) {
    $hashtbl."Line$($f.ReadCount)"=[regex]$f
}

foreach ($key in $hashtbl.Keys) {
    $FilePresentStatus = $hashtbl[$key].Matches($filesfromSFTP)
    if ($FilePresentStatus.Count -eq 0) {
        $MissingFiles += $hashtbl[$key].ToString()
    }
}

Upvotes: 0

Paweł Dyl
Paweł Dyl

Reputation: 9143

20000 is not that much, but at worst you have to do 20000x20000=400000000 operations. The key is to stop as soon as possible in each. You can also use much faster [string].Contains method instead of regular expression based Select-String (unless -SimpleMatch switch is used).

See following demo:

$db =   1000000..1020000
$sftp = (1001000..1021000 | % { "$($_)_SomeNotImportantTextHere" }) -join "`r`n"

$missingFiles = $db | where { !$sftp.Contains($_) }

Each collection contains 20000 items, 19000 common, 1000 exists only in $db. It runs in couple of seconds.

To read $filesfromSFTP as one big string, use:

gc file2.txt -Raw

To convert result to single string, use $missingFiles -join 'separator'.

Upvotes: 1

Armin Lizde
Armin Lizde

Reputation: 1

I think your problem lies in the += operator, try this https://powershell.org/2013/09/16/powershell-performance-the-operator-and-when-to-avoid-it/

Upvotes: 0

Related Questions