scobi
scobi

Reputation: 14558

How to process a file in PowerShell line-by-line as a stream

I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.

Unfortunately, get-content | %{ whatever($_) } appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.

So my question is two parts:

  1. How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
  2. How can I make it run faster? PowerShell iterating over a get-content appears to be 100x slower than a C# script.

I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize parameter or something...

Upvotes: 102

Views: 293064

Answers (4)

Steve
Steve

Reputation: 347

For those interested...

A bit of perspective on this, since I had to work with very large files.

Below are the results on a 39 GB xml file containing 56 million lines/records. The lookup text is a 10 digit number

1) GC -rc 1000 | % -match -> 183 seconds
2) GC -rc 100 | % -match  -> 182 seconds
3) GC -rc 1000 | % -like  -> 840 seconds
4) GC -rc 100 | % -like   -> 840 seconds
5) sls -simple            -> 730 seconds
6) sls                    -> 180 seconds (sls default uses regex, but pattern in my case is passed as literal text)
7) Switch -file -regex    -> 258 seconds
8) IO.File.Readline       -> 250 seconds

1 and 6 are clear winners but I have gone with 1

PS. The test was conducted on a Windows Server 2012 R2 server with PS 5.1. The server has 16 vCPUs and 64 GB memory but for this test only 1 CPU was utilised whereas the PS process memory footprint was bare minimum as the tests above make use of very little memory.

Upvotes: 0

Roman Kuzmin
Roman Kuzmin

Reputation: 42073

If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:

# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }

# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }

# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }

UPDATE: If you are still not scared then try to use the .NET reader:

$reader = [System.IO.File]::OpenText("my.log")
try {
    for() {
        $line = $reader.ReadLine()
        if ($line -eq $null) { break }
        # process the line
        $line
    }
}
finally {
    $reader.Close()
}

UPDATE 2

There are comments about possibly better / shorter code. There is nothing wrong with the original code with for and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is

$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
    $line
}

Upvotes: 101

Despertar
Despertar

Reputation: 22392

System.IO.File.ReadLines() is perfect for this scenario. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.

Requires .NET 4.0 or higher.

foreach ($line in [System.IO.File]::ReadLines($filename)) {
    # do something with $line
}

http://msdn.microsoft.com/en-us/library/dd383503.aspx

Upvotes: 53

Chris Blydenstein
Chris Blydenstein

Reputation: 235

If you want to use straight PowerShell check out the below code.

$content = Get-Content C:\Users\You\Documents\test.txt
foreach ($line in $content)
{
    Write-Host $line
}

Upvotes: 1

Related Questions