Searching many large text files in Powershell

Question

I frequently have to search server log files in a directory that may contain 50 or more files of 200+ MB each. I've written a function in Powershell to do this searching. It finds and extracts all the values for a given query parameter. It works great on an individual large file or a collection of small files but totally bites in the above circumstance, a directory of large files.

The function takes a parameter, which consists of the query parameter to be searched.

In pseudo-code:

Take parameter (e.g. someParam or someParam=([^& ]+))
Create a regex (if one is not supplied)
Collect a directory list of *.log, pipe to Select-String
For each pipeline object, add the matchers to a hash as keys
Increment a match counter
Call GC
At the end of the pipelining: 
if (hash has keys) 
    enumerate the hash keys, 
    sort and append to string array
    set-content the string array to a file 
    print summary to console
    exit
else
    print summary to console
    exit

Here's a stripped-down version of the file processing.

$wtmatches = @{};
gci -Filter *.log | Select-String -Pattern $searcher |       
%{ $wtmatches[$_.Matches[0].Groups[1].Value]++; $items++; [GC]::Collect(); }

I'm just using an old perl trick of de-duplicating found items by making them the keys of a hash. Perhaps, this is an error, but a typical output of the processing is going to be around 30,000 items at most. More normally, found items is in the low thousands range. From what I can see, the number of keys in the hash does not affect processing time, it is the size and number of the files that breaks it. I recently threw in the GC in desperation, it does have some positive effect but it is marginal.

The issue is that with the large collection of large files, the processing sucks the RAM pool dry in about 60 seconds. It doesn't actually use a lot of CPU, interestingly, but there's a lot of volatile storage going on. Once the RAM usage has gone up over 90%, I can just punch out and go watch TV. It could take hours to complete the processing to produce a file with 15,000 or 20,000 unique values.

I would like advice and/or suggestions for increasing the efficiency, even if that means using a different paradigm to accomplish the processing. I went with what I know. I use this tool on almost a daily basis.

Oh, and I'm committed to using Powershell. ;-) This function is part of a complete module I've written for my job, so, suggestions of Python, perl or other useful languages are not useful in this case.

Thanks.

mp

Update: Using latkin's ProcessFile function, I used the following wrapper for testing. His function is orders of magnitude faster than my original.

function Find-WtQuery {

<#
 .Synopsis
  Takes a parameter with a capture regex and a wildcard for files list.

 .Description
  This function is intended to be used on large collections of large files that have
  the potential to take an unacceptably long time to process using other methods. It
  requires that a regex capture group be passed in as the value to search for.

 .Parameter Target
  The parameter with capture group to find, e.g. WT.z_custom=([^ &]+).

 .Parameter Files
  The file wildcard to search, e.g. '*.log'

 .Outputs
  An object with an array of unique values and a count of total matched lines.
#>

        param(
        [Parameter(Mandatory = $true)] [string] $target,
        [Parameter(Mandatory = $false)] [string] $files
    )

    begin{
        $stime = Get-Date
    }
    process{
        $results = gci -Filter $files | ProcessFile -Pattern $target  -Group 1;
    }
    end{
        $etime = Get-Date;
        $ptime = $etime - $stime;
        Write-Host ("Processing time for {0} files was {1}:{2}:{3}." -f (gci   
    -Filter $files).Count, $ptime.Hours,$ptime.Minutes,$ptime.Seconds);
        return $results;
    }
}

The output:

clients:	est\logs\global
{powem} [4] --> Find-WtQuery -target "WT.ets=([^ &]+)" -files "*.log"
Processing time for 53 files was 0:1:35.

Thanks to all for comments and help.

latkin · Accepted Answer

Here's a function that will hopefully speed up and reduce the memory impact of the file processing part. It will return an object with 2 properties: The total count of lines matched, and a sorted array of unique strings from the match group specified. (From your description it sounds like you don't really care about the count per string, just the string values themselves)

function ProcessFile
{
   param(
      [Parameter(ValueFromPipeline = $true, Mandatory = $true)]
      [System.IO.FileInfo] $File,

      [Parameter(Mandatory = $true)]
      [string] $Pattern,

      [Parameter(Mandatory = $true)]
      [int] $Group
   )

   begin
   {
      $regex = new-object Regex @($pattern, 'Compiled')
      $set = new-object 'System.Collections.Generic.SortedDictionary[string, int]'
      $totalCount = 0
   }

   process
   {
      try
      {
        $reader = new-object IO.StreamReader $_.FullName

        while( ($line = $reader.ReadLine()) -ne $null)
        {
           $m = $regex.Match($line)
           if($m.Success)
           {
              $set[$m.Groups[$group].Value] = 1      
              $totalCount++
           }
        }
      }
      finally
      {
         $reader.Close()
      }
   }

   end
   {
      new-object psobject -prop @{TotalCount = $totalCount; Unique = ([string[]]$set.Keys)}
   }
}

You can use it like this:

$results = dir *.log | ProcessFile -Pattern 'stuff (capturegroup)' -Group 1
"Total matches: $($results.TotalCount)"
$results.Unique | Out-File .\Results.txt

Searching many large text files in Powershell

Answers (2)

Related Questions