Predrag Vasić
Predrag Vasić

Reputation: 351

Improving performance on PowerShell filtering statement

I have a script that goes through HTTP access log, filters out some lines based on a regex patern and copies them into another file:

param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" | 
Select-string -pattern $pattern | 
Add-Content "D:\webStatistics\log\filtered-$workingdate.log"

My logs can be quite large (up to 2GB), which takes up to 15 minutes to run. Is there anything I can to do improve the performance of the statement above?

Thank you for your thoughts!

Upvotes: 2

Views: 1595

Answers (3)

campbell.rw
campbell.rw

Reputation: 1386

You could also try seeing if using streams would speed it up. Something like this might help, although I couldn't test it because, as mentioned above, I'm not sure what patter you are using.

param($workingdate=(get-date).ToString("yyMMdd"))

$file = New-Object System.IO.StreamReader -Arg "access-$workingdate.log"
$stream = New-Object System.IO.StreamWriter -Arg "D:\webStatistics\log\filtered-$workingdate.log"

while ($line = $file.ReadLine()) {
    if($line -match $pattern){
        $stream.WriteLine($line)    
    }
}
$file.close()
$stream.Close()

Upvotes: 0

Zan Lynx
Zan Lynx

Reputation: 54325

You don't show your patterns, but I suspect they are a large part of the problem.

You will want to look for a new question here (I am sure it has been asked) or elsewhere for detailed advice on building fast regular expression patterns.

But I find the best advice is to anchor your patterns and avoid runs of unknown length of all characters.

So instead of a pattern like path/.*/.*\.js use one with a $ on the end to anchor it to the end of the string. That way the regex engine can tell immediately that index.html is not a match. Otherwise it has to do some rather complicated scans with path/ and .js possibly showing up anywhere in the string. This example of course assumes the file name is at the end of the log line.

Anchors work well with start of line patterns as well. A pattern might look like ^[^"]*"GET /myfile" That has a unknown run length but at least it knows that it doesn't have to restart the search for more quotes after finding the first one. The [^"] character class allows the regex engine to stop because the pattern can't match after the first quote.

Upvotes: 3

mjolinor
mjolinor

Reputation: 68273

See if this isn't faster than your current solution:

param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" -ReadCount 2000 |
 foreach { $_ -match $pattern | 
  Add-Content "D:\webStatistics\log\filtered-$workingdate.log"
 }

Upvotes: 4

Related Questions