Optimizing a script

Question

Info

I've created a script which analyzes the debug logs from Windows DNS Server.

It does the following:

Open debug log using [System.IO.File] class
Perform a regex match on each line
Separate 16 capture groups into different properties inside a custom object
Fills dictionaries and appends to the value of each key to produce statistics

Steps 1 and 2 take the longest. In fact, they take a seemingly endless amount of time, because the file is growing as it is being read.

Problem

Due to the size of the debug log (80,000kb) it takes a very long time.

I believe that my code is fine for smaller text files, but it fails to deal with much larger files.

Code

Here is my code: https://github.com/cetanu/msDnsStats/blob/master/msdnsStats.ps1

Debug log preview

This is what the debug looks like (including the blank lines)

Multiply this by about 100,000,000 and you have my debug log.

21/03/2014 2:20:03 PM 0D0C PACKET  0000000005FCB280 UDP Rcv 202.90.34.177   3709   Q [1001   D   NOERROR] A      (2)up(13)massrelevance(3)com(0)

21/03/2014 2:20:03 PM 0D0C PACKET  00000000042EB8B0 UDP Rcv 67.215.83.19    097f   Q [0000       NOERROR] CNAME  (15)manchesterunity(3)org(2)au(0)

21/03/2014 2:20:03 PM 0D0C PACKET  0000000003131170 UDP Rcv 62.36.4.166     a504   Q [0001   D   NOERROR] A      (3)ekt(4)user(7)net0319(3)com(0)

21/03/2014 2:20:03 PM 0D0C PACKET  00000000089F1FD0 UDP Rcv 80.10.201.71    3e08   Q [1000       NOERROR] A      (4)dns1(5)offis(3)com(2)au(0)

Request

I need ways or ideas on how to open and read each line of a file more quickly than what I am doing now.

I am open to suggestions of using a different language.

mjolinor · Accepted Answer

I would trade this:

$dnslog = [System.IO.File]::Open("c:\dns.log","Open","Read","ReadWrite")
$dnslog_content = New-Object System.IO.StreamReader($dnslog)


For ($i=0;$i -lt $dnslog.length; $i++)
{


    $line = $dnslog_content.readline()
    if ($line -eq $null) { continue }


    # REGEX MATCH EACH LINE OF LOGFILE
    $pattern = $line | select-string -pattern $regex



    # IGNORE EMPTY MATCH
    if ($pattern -eq $null) {
            continue
    }

for this:

Get-Content 'c:\dns.log' -ReadCount 1000 |
 ForEach-Object {
   foreach ($line in $_)
    {
      if ($line -match $regex)
       {
         #Process matches
       }
    }

That will reduce then number of file read operations by a factor of 1000.

Trading the select-string operation will require re-factoring the rest of the code to work with $matches[n] instead of $pattern.matches[0].groups[$n].value, but is much faster. Select-String returns matchinfo objects which contain a lot of additional information about the match (line number, filename, etc.) which is great if you need it. If all you need is strings from the captures then it's wasted effort.

You're creating an object ($log), and then accumulating values into array properties:

$log.date                += @($pattern.matches[0].groups[$n].value); $n++

that array addition is going to kill your performance. Also, hash table operations are faster than object property updates.

I'd create $log as a hash table first, and the key values as array lists:

$log = @{}
$log.date = New-Object collections.arraylist

Then inside your loop:

$log.date.add($matches[1]) > $nul)

Then create your object from $log after you've populated all of the array lists.