Predrag Vasić
Predrag Vasić

Reputation: 351

Sort very large text file in PowerShell

I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.

The simplest and most obvious thing that comes to mind is

 Get-Content unsorted.txt | sort | get-unique > sorted.txt

I am guessing (without having tried it) that doing this using Get-Content would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader, but I'm curious if an efficient solution could be put together using that?

Thanks to anyone who might have a more efficient idea.

[edit]

I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.

Upvotes: 11

Views: 31369

Answers (7)

pico
pico

Reputation: 1910

With the function below:

PS> PowerSort -SrcFile C:\windows\win.ini
function PowerSort {
    param(
        [string]$SrcFile = "",
        [string]$DstFile = "",
        [switch]$Force
    )

    if ($SrcFile -eq "") {
        write-host "USAGE: PowerSort -SrcFile (srcfile)  [-DstFile (dstfile)] [-Force]"
        return 0;
    }
    else {
        $SrcFileFullPath = Resolve-Path $SrcFile -ErrorAction SilentlyContinue -ErrorVariable _frperror        
        if (-not($SrcFileFullPath)) {
            throw "Source file not found: $SrcFile";
        }
    }

    [Collections.Generic.List[string]]$lines = [System.IO.File]::ReadAllLines($SrcFileFullPath)
    
    $lines.Sort();

    # Write Sorted File to Pipe
    if ($DstFile -eq "") {
        foreach ($line in $lines) {
            write-output $line
        }           
    }
    
    # Write Sorted File to File
    else {
        $pipe_enable = 0;
        $DstFileFullPath = Resolve-Path $DstFile -ErrorAction SilentlyContinue -ErrorVariable ev

        # Destination File doesn't exist        
        if (-not($DstFileFullPath)) {
           $DstFileFullPath = $ev[0].TargetObject       
        }
        
        # Destination Exists and -force not specified.
        elseif (-not $Force) {
            throw "Destination file already exists: ${DstFile}  (using -Force Flag to overwrite)"           
        }       
        
        write-host "Writing-File: $DstFile"
        [System.IO.File]::WriteAllLines($DstFileFullPath, $lines)
    }
    return
}

Upvotes: 0

js2010
js2010

Reputation: 27516

There doesn't seem to be a great way to do it in powershell, including [IO.File]::ReadLines(), but with the native windows sort.exe or the gnu sort.exe, either within cmd.exe, 30 million random numbers can be sorted in about 5 minutes with around 1 gb of ram. The gnu sort automatically breaks things up into temp files to save ram. Both commands have options to start the sort at a certain character column. Gnu sort can merge sorted files. See external sorting.

30 million line test file:

& { foreach ($i in 1..300kb) { get-random } } | set-content file.txt

And then in cmd:

copy file.txt+file.txt file2.txt
copy file2.txt+file2.txt file3.txt
copy file3.txt+file3.txt file4.txt
copy file4.txt+file4.txt file5.txt
copy file5.txt+file5.txt file6.txt
copy file6.txt+file6.txt file7.txt
copy file7.txt+file7.txt file8.txt

With gnu sort.exe from http://gnuwin32.sourceforge.net/packages/coreutils.htm . Don't forget the dependency dll's -- libiconv2.dll & libintl3.dll. Within cmd.exe:

.\sort.exe < file8.txt > filesorted.txt

Or windows sort.exe within cmd.exe:

sort.exe < file8.txt > filesorted.txt

Upvotes: 0

Carsten
Carsten

Reputation: 2211

"Get-Content" can be faster than you think. Check this code-snippet in addition to the above solution:

foreach ($block in (get-content $file -ReadCount 100)) {
    foreach ($line in $block){[void] $hs.Add($line)}
}

Upvotes: 0

Robert Cotterman
Robert Cotterman

Reputation: 2268

I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii

Another trick, seriously is to just use linux.

cat file.txt | sort -u > output.txt

Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.

It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.

Upvotes: 4

E.Z. Hart
E.Z. Hart

Reputation: 5757

(Edited to be more clear based on n0rd's comments)

It's might be a memory issue. Since you're loading the entire file into memory to sort it (and adding the overhead of the pipe into Sort-Object and the pipe into Get-Unique), it's possible that you're hitting the memory limits of the machine and forcing it to page to disk, which will slow things down a lot. One thing you might consider is splitting the logs up before sorting them, and then splicing them back together.

This probably won't match your format exactly, but if I've got a large log file for, say, 8/16/2012 which spans several hours, I can split it up into a different file for each hour using something like this:

for($i=0; $i -le 23; $i++){ Get-Content .\u_ex120816.log | ? { $_ -match "^2012-08-16 $i`:" } | Set-Content -Path "$i.log" }

This is creating a regular expression for each hour of that day and dumping all the matching log entries into a smaller log file named by the hour (e.g. 16.log, 17.log).

Then I can run your process of sorting and getting unique entries on a much smaller subsets, which should run a lot faster:

 for($i=0; $i -le 23; $i++){ Get-Content "$i.log" | sort | get-unique > "$isorted.txt" }

And then you can splice them back together.

Depending on the frequency of the logs, it might make more sense to split them by day, or minute; the main thing is to get them into more manageable chunks for sorting.

Again, this only makes sense if you're hitting the memory limits of the machine (or if Sort-Object is using a really inefficient algorithm).

Upvotes: 0

n0rd
n0rd

Reputation: 12670

Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.

Let's set up a base line:

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:

Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663

Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.

$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
    while (($line = $reader.ReadLine()) -ne $null)
    {
        $t = $hs.Add($line)
    }
}
finally {
    $reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
    $f = New-Object System.IO.StreamWriter "d:\result2.txt";
    foreach ($s in $ls)
    {
        $f.WriteLine($s);
    }
}
finally
{
    $f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

this script produces:

read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802

On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.

Upvotes: 9

Lance U. Matthews
Lance U. Matthews

Reputation: 16612

If each line of the log is prefixed with a timestamp, and the log messages don't contain embedded newlines (which would require special handling), I think it would take less memory and execution time to convert the timestamp from [String] to [DateTime] before sorting. The following assumes each log entry is of the format yyyy-MM-dd HH:mm:ss: <Message> (note that the HH format specifier is used for a 24-hour clock):

Get-Content unsorted.txt
    | ForEach-Object {
        # Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above
        if (-not [String]::IsNullOrEmpty($_))
        {
            # Split into at most two fields, even if the message itself contains ': '
            [String[]] $fields = $_ -split ': ', 2;

            return New-Object -TypeName 'PSObject' -Property @{
                Timestamp = [DateTime] $fields[0];
                Message   = $fields[1];
            };
        }
    } | Sort-Object -Property 'Timestamp', 'Message';

If you are processing the input file for interactive display purposes you can pipe the above into Out-GridView or Format-Table to view the results. If you need to save the sorted results you can pipe the above into the following:

    | ForEach-Object {
        # Reconstruct the log entry format of the input file
        return '{0:yyyy-MM-dd HH:mm:ss}: {1}' -f $_.Timestamp, $_.Message;
    } `
    | Out-File -Encoding 'UTF8' -FilePath 'sorted.txt';

Upvotes: 0

Related Questions