Reputation: 233

PS get-content high memory usage - Is there a more efficient way to filter a file?

I am using get-content to read a largish file (252 MB), but when I use get-content to read it, the powershell process proceeds to consume almost 10 GB of memory. Is this normal behavior?

The array has just shy of 6 million items. It doesn't seem to be remotely in line with the amount of memory being used.

Maybe I'm just going about this the wrong way entirely.

I want to write the line that matches a string and the subsequent line to a new text file.

$mytext = get-content $inpath
$search = "*tacos*"
$myindex = 0..($mytext.count - 1) | Where {$mytext[$_] -like $search}
$outtext = @()
foreach ($i in $myindex){
    $outtext = $outtext + $mytext[$i] + $mytext[$i+1]
    }
$outtext | out-file -filepath $outpath

Performance Testing Results

I took a performance sample for different scripts based on different answers here.

My original script

(highly sensitive to the number of lines that get written out)

10k lines - 1.8s
100k lines - 38s
100k lines - 21s (when search string occurs rarely)
5000k lines - Too long to measure (aborted after several hours)

Select-String with no Get-Content (adapted from whatever)

Select-String -path $inpath -pattern $search -Context 0,1 -SimpleMatch | Out-File $outpath

10k lines - 1.2s
100k lines - 4s
1000k lines - 107s

Note processing speed only grows by a factor of ~4 for a 10x increase in input. The more data you try to process at once, the better this solution becomes relative to the others.

Eliminating the array resize (from Mathias)

10k lines - 2.0s
100k lines - 25s
1000k lines - 1533s (using 1.7GB memory, the same as running gc outside of the script on 1000k lines)

Using the pipeline (from Chris Dent)

100k lines - 26s

Upvotes: 3

Answers (3)

whatever

Reputation: 891

Another option is Select-String:

$search = "tacos"
Get-Content $inpath | Select-String $search -Context 0,1 | Out-File $OutputFile -Append

However, this will produce slightly changed output:

match
following line

will turn into

> match
  following line

if you want the exact lines from the file:

Get-Content $inpath | Select-String $search -Context 0,1 | foreach {$_.Line | Out-File $OutputFile -Append ; $_.Context.Postcontext |  Out-File $OutputFile -Append}

Btw: Get-Content gets kinda slow once files get really big. Once that happens it might be better to do:

$TMPVar = Get-Content $inpath -Readcount 0
$TMPVar | Select-String....

This will make Get-Content read the entire file at once instead of line by line which is much faster but needs a bit more memory than piping it directly into the next cmdlets.

Upvotes: 2

Chris Dent

Reputation: 4250

The pipeline is your friend. There's no advantage to be gained from your indexing process other than making it take longer and add more into memory.

This gets the line you're searching for, plus the one line of context you need (from the example). Nothing is loaded into memory except the items that match your search plus that one line.

$getNext = $false
$outtext = Get-Content $inPath | ForEach-Object {
    if ($_ -like $search) {
        $_
        $getNext = $true
    }
    elseif ($getNext) { #reads the following line on next iteration
        $_
        $getNext = $false
    }
}

Upvotes: 2

Mathias R. Jessen

Reputation: 174690

process proceeds to consume almost 10 GB of memory. [...] The array has just shy of 6 million items. It doesn't seem to be remotely in line with the amount of memory being used.

Get-Content against a file of 6 million lines results in 6 million string objects - and allocating a string object is not just allocating memory for the characters themselves but also an object header and additional overhead.

That would only account for about 5-10% of what you're seeing though - the real problem is this construct:

$outtext = @() # this
foreach ($i in $myindex){
    $outtext = $outtext + $mytext[$i] + $mytext[$i+1] # and this
}

Every time you re-assign the values of the array like that, the underlying array has to be resized, causing .NET to copy the contents to a new array.

Change it to:

$outtext = foreach ($i in $myindex){
    $mytext[$i],$mytext[$i+1]
}

Upvotes: 4

PS get-content high memory usage - Is there a more efficient way to filter a file?

Performance Testing Results

Answers (3)

Related Questions