Reputation: 843

How to split a very large text file (4GB) at pre-defined string in powershell and do it fast

I have a large text file World.net (Which is a Pajek file, but consider it as text) with content:

*Vertices 999999
    1 ""                                       0.2931    0.2107    0.5000 empty
    2 ""                                       0.2975    0.2214    0.5000
    3 ""                                       0.3083    0.2258    0.5000
    4 ""                                       0.3127    0.2406    0.5000
    5 ""                                       0.3083    0.2514    0.5000
    6 ""                                       0.3147    0.2578    0.5000
...
    999999 ""                                       0.3103    0.2622    0.5000
*Edges :2 "World contours"
    1     2 1 
    2     3 1 
    3     4 1 
    4     5 1 
    5     6 1 
    6     7 1 
...
    983725     8 1

I would like to split it into different .txt files, at the lines that start with

*[Something]

The [Something] should go into the file name like World_Vertices.txt and World_Edges.txt.

File contents should be the lines (1,2,3...), following each category (Vertices, Edges) from the original file, without the category name (which starts with *).

I have a code that (kind-of) works:

$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
    }
    Else {
        $line | Out-File -Append $newfile
    }
}

But this code is very slow. It takes 20 minutes on a 10 mb file. And I would like to be able to process a 4GB file.

Hardware notes: The machine is good: i7 with hybrid disk, 16GB ram and I can install .net framework whichever-is-needed-to-do-the-job.

Upvotes: 2

Answers (3)

iRon

Reputation: 23862

Writing in general takes a lot of overhead.
So keep the section data in memory until it is completed and then write the whole section at once:

$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        If ($newfile) {$section | Out-File $newfile}
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
        $section = @()
    }
    Else {
        $Section += $line
    }
}
If ($newfile) {$section | Out-File $newfile}

Upvotes: 1

TylerH

Reputation: 21069

_{Migrating OP's solution from the question to an answer:}

Fixing a few bugs in the accepted answer, here is the final code I used (It may be helpful for anyone, who wants to edit large pajek files):

$filename = "World.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename"
$writer = $null
$n = 0
while (($line = $file.ReadLine()) -ne $null) {
   If ($line.StartsWith("*")) {
       $n = 1
       $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
       echo $newfile
       if ($null -ne $writer) {
           $writer.Dispose()
       }
       $writer = New-Object System.IO.StreamWriter "$pwd\$newfile"
   }
   Else {
       If ($n -eq 0){
           $writer.WriteLine()
       }
       $writer.Write($line)
       $n = 0
   }
}
$writer.Dispose()

Upvotes: 0

marsze

Reputation: 17164

In general, using .NET functions inside PowerShell is always the best way when performance is important. So using a StreamReader is already a good approach.

I changed your code to use a StreamWriter for writing to the output files:

$filename = "World"
echo "$pwd\$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
$writer = $null
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
        if ($null -ne $writer) {
            $writer.Dispose()
        }
        $writer = New-Object System.IO.StreamWriter "$pwd\$newfile"
    }
    Else {
        $writer.WriteLine($line)
    }
}

Try it.

There are other ways to further improve your performance. For instance, you might skip the expensive regex check. Use this instead:

if ($line.StartsWith("*"))

Upvotes: 2

How to split a very large text file (4GB) at pre-defined string in powershell and do it fast

Answers (3)

Related Questions