Reputation: 843
I have a large text file World.net (Which is a Pajek file, but consider it as text) with content:
*Vertices 999999
1 "" 0.2931 0.2107 0.5000 empty
2 "" 0.2975 0.2214 0.5000
3 "" 0.3083 0.2258 0.5000
4 "" 0.3127 0.2406 0.5000
5 "" 0.3083 0.2514 0.5000
6 "" 0.3147 0.2578 0.5000
...
999999 "" 0.3103 0.2622 0.5000
*Edges :2 "World contours"
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 1
...
983725 8 1
I would like to split it into different .txt files, at the lines that start with
*[Something]
The [Something] should go into the file name like World_Vertices.txt and World_Edges.txt.
File contents should be the lines (1,2,3...), following each category (Vertices, Edges) from the original file, without the category name (which starts with *).
I have a code that (kind-of) works:
$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
If ($line -match "^\*\w+") {
$newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
echo $newfile
}
Else {
$line | Out-File -Append $newfile
}
}
But this code is very slow. It takes 20 minutes on a 10 mb file. And I would like to be able to process a 4GB file.
Hardware notes: The machine is good: i7 with hybrid disk, 16GB ram and I can install .net framework whichever-is-needed-to-do-the-job.
Upvotes: 2
Views: 5701
Reputation: 23862
Writing in general takes a lot of overhead.
So keep the section data in memory until it is completed and then write the whole section at once:
$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
If ($line -match "^\*\w+") {
If ($newfile) {$section | Out-File $newfile}
$newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
echo $newfile
$section = @()
}
Else {
$Section += $line
}
}
If ($newfile) {$section | Out-File $newfile}
Upvotes: 1
Reputation: 21069
Migrating OP's solution from the question to an answer:
Fixing a few bugs in the accepted answer, here is the final code I used (It may be helpful for anyone, who wants to edit large pajek files):
$filename = "World.net" $file = New-Object System.IO.StreamReader -Arg "$pwd\$filename" $writer = $null $n = 0 while (($line = $file.ReadLine()) -ne $null) { If ($line.StartsWith("*")) { $n = 1 $newfile = -join("$filename ","$($line.Split('\*')[1]).txt") echo $newfile if ($null -ne $writer) { $writer.Dispose() } $writer = New-Object System.IO.StreamWriter "$pwd\$newfile" } Else { If ($n -eq 0){ $writer.WriteLine() } $writer.Write($line) $n = 0 } } $writer.Dispose()
Upvotes: 0
Reputation: 17164
In general, using .NET functions inside PowerShell is always the best way when performance is important. So using a StreamReader
is already a good approach.
I changed your code to use a StreamWriter
for writing to the output files:
$filename = "World"
echo "$pwd\$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
$writer = $null
while (($line = $file.ReadLine()) -ne $null) {
If ($line -match "^\*\w+") {
$newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
echo $newfile
if ($null -ne $writer) {
$writer.Dispose()
}
$writer = New-Object System.IO.StreamWriter "$pwd\$newfile"
}
Else {
$writer.WriteLine($line)
}
}
Try it.
There are other ways to further improve your performance. For instance, you might skip the expensive regex check. Use this instead:
if ($line.StartsWith("*"))
Upvotes: 2