sftamps
sftamps

Reputation: 17

Parsing large text files (1 GB / 1.3 million lines) and efficiently split records into filenames

I have a 1 GB text file and my PowerShell code is taking 5 hours to split it based on record names.

"STD|AAAA|X|dummy" "dummy"
"STD|BBBB|X|dummy" "dummy"
"STD|CCCC|X|dummy" "dummy"
"STD|AAAA|X|dummy" "dummy"

Expected result is to create 3 text files (AAAA.txt, BBBB.txt, CCCC.txt) which also contain the matched lines.

$data = get-content "$input_path"

foreach ($line in $data) {
    $matches  = [regex]::Match($line, 'STD\|(?<TheFilename>[^\|`"]+)[\|`"]+')
    $FirstLvl = $matches.Groups['TheFilename']

    if ($FirstLvl.Value -ne "") {
        $FullPath = Join-Path $ParentPath -ChildPath $FirstLvl.Value
        $line | Out-File -FilePath "$FullPath" -Append
    }
}

Upvotes: 1

Views: 358

Answers (1)

Ansgar Wiechers
Ansgar Wiechers

Reputation: 200193

First of all, do not read the entire input file into memory. Use a pipeline instead. And split the lines at pipes for extracting the file basename rather than using a regular expression match. Also, are there actually lines that don't have a field for the basename? Otherwise checking whether or not $FirstLvl is empty is a waste of resources.

Get-Content $input_path | ForEach-Object {
    $FirstLvl = $_.Split('|')[1]
    $_ | Add-Content "${ParentPath}\${FirstLvl}.txt"
}

If you require better performance than that you need to work with .Net methods.

$reader  = [IO.StreamReader]$input_path
$writers = @{}

while ($reader.Peek() -ge 0) {
    $line     = $reader.ReadLine()
    $FirstLvl = $line.Split('|')[1]

    if (-not $writers.Contains($FirstLvl)) {
        $writers[$FirstLvl] = [IO.StreamWriter]"${ParentPath}\${FirstLvl}.txt"
    }

    $writers[$FirstLvl].WriteLine($line)
}

$reader.Close()
$reader.Dispose()
foreach ($key in $writers.Keys) {
    $writers[$key].Close()
    $writers[$key].Dispose()
}

By storing individual writers per output file in a hashtable you avoid having to re-open output files repeatedly.

Upvotes: 2

Related Questions