skyline01
skyline01

Reputation: 2099

How to modify contents of a pipe-delimited text file with PowerShell

I have a pipe-delimited text file. The file contains "records" of various types. I want to modify certain columns for each record type. For simplicity, let's say there are 3 record types: A, B, and C. A has 3 columns, B has 4 columns, and C has 5 columns. For example, we have:

A|stuff|more_stuff
B|123|other|x
C|something|456|stuff|more_stuff
B|78903|stuff|x
A|1|more_stuff

I want to append the prefix "P" to all desired columns. For A, the desired column is 2. For B, the desired column is 3. For C, the desired column is 4.

So, I want the output to look like:

A|Pstuff|more_stuff
B|123|Pother|x
C|something|456|Pstuff|more_stuff
B|78903|Pstuff|x
A|P1|more_stuff

I need to do this in PowerShell. The file could be very large. So, I'm thinking about going with the File-class of .NET. If it were a simple string replacement, I would do something like:

$content = [System.IO.File]::ReadAllText("H:\test_modify_contents.txt").Replace("replace_text","something_else")
[System.IO.File]::WriteAllText("H:\output_file.txt", $content)

But, it's not so simple in my particular situation. So, I'm not even sure if ReadAllText and WriteAllText is the best solution. Any ideas on how to do this?

Upvotes: 0

Views: 4142

Answers (3)

Jower
Jower

Reputation: 575

Given the large input file, i would not use either ReadAllText or Get-Content. They actually read the entire file into memory.

Consider using something along the lines of

$filename = ".\input2.csv"
$outfilename = ".\output2.csv"

function ProcessFile($inputfilename, $outputfilename)
{
    $reader = [System.IO.File]::OpenText($inputfilename)
    $writer = New-Object System.IO.StreamWriter $outputfilename
    $record = $reader.ReadLine()
    while ($record -ne $null)
    {
        $writer.WriteLine(($record -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'))
        $record = $reader.ReadLine()
    }

    $reader.Close()
    $reader.Dispose()
    $writer.Close()
    $writer.Dispose()
}    

ProcessFile $filename $outfilename

EDIT: After testing all the suggestions on this page, i have borrowed the regex from Dave Sexton and this is the fastest implementation. Processes a 1gb+ file in 175 seconds. All other implementations are significantly slower on large input files.

Upvotes: 1

Dave Sexton
Dave Sexton

Reputation: 11188

If your file sizes are large then reading the complete file contents at once using Import-Csv or ReadAll is probably not a good idea. I would use Get-Content cmdlet using the ReadCount property which will stream the file one row at time and then use a regex for the processing. Something like this:

Get-Content your_in_file.txt -ReadCount 1 | % {
  $_ -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'
} | Set-Content your_out_file.txt

EDIT: This version should output faster:

$d = Get-Date
Get-Content input.txt -ReadCount 1000 | % {
    $_ | % {
        $_ -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'
    } | Add-Content output.txt 
}
(New-TimeSpan $d (Get-Date)).Milliseconds

For me this processed 50k rows in 350 milliseconds. You probably get more speed by tweaking the -ReadCount value to find the ideal amount.

Upvotes: 1

xXhRQ8sD2L7Z
xXhRQ8sD2L7Z

Reputation: 1716

I would ConvertFrom-Csv so you can check each line as an object. On this code, I did add a header, but mainly for code readability. The header is cut out of the output on the last line anyway:

$input = "H:\test_modify_contents.txt"
$output = "H:\output_file.txt"
$data = Get-Content -Path $input | ConvertFrom-Csv -Delimiter '|' -Header 'Column1','Column2','Column3','Column4','Column5'

$data | % {
   If ($_.Column5) {
      #type C:
      $_.Column4 = "P$($_.Column4)"
   } ElseIf ($_.Column4) {
      #type B:
      $_.Column3 = "P$($_.Column3)"
   } Else {
      #type A:
      $_.Column2 = "P$($_.Column2)"
   }
}

$data | Select Column1,Column2,Column3,Column4,Column5 | ConvertTo-Csv -Delimiter '|' -NoTypeInformation | Select-Object -Skip 1 | Set-Content -Path $output

It does add extra | for the type A and B lines. Output:

"A"|"Pstuff"|"more_stuff"||
"B"|"123"|"Pother"|"x"|
"C"|"something"|"456"|"Pstuff"|"more_stuff"
"B"|"78903"|"Pstuff"|"x"|
"A"|"P1"|"more_stuff"||

Upvotes: 1

Related Questions