Reputation: 2099
I have a pipe-delimited text file. The file contains "records" of various types. I want to modify certain columns for each record type. For simplicity, let's say there are 3 record types: A, B, and C. A has 3 columns, B has 4 columns, and C has 5 columns. For example, we have:
A|stuff|more_stuff
B|123|other|x
C|something|456|stuff|more_stuff
B|78903|stuff|x
A|1|more_stuff
I want to append the prefix "P" to all desired columns. For A, the desired column is 2. For B, the desired column is 3. For C, the desired column is 4.
So, I want the output to look like:
A|Pstuff|more_stuff
B|123|Pother|x
C|something|456|Pstuff|more_stuff
B|78903|Pstuff|x
A|P1|more_stuff
I need to do this in PowerShell. The file could be very large. So, I'm thinking about going with the File-class of .NET. If it were a simple string replacement, I would do something like:
$content = [System.IO.File]::ReadAllText("H:\test_modify_contents.txt").Replace("replace_text","something_else")
[System.IO.File]::WriteAllText("H:\output_file.txt", $content)
But, it's not so simple in my particular situation. So, I'm not even sure if ReadAllText and WriteAllText is the best solution. Any ideas on how to do this?
Upvotes: 0
Views: 4142
Reputation: 575
Given the large input file, i would not use either ReadAllText or Get-Content. They actually read the entire file into memory.
Consider using something along the lines of
$filename = ".\input2.csv"
$outfilename = ".\output2.csv"
function ProcessFile($inputfilename, $outputfilename)
{
$reader = [System.IO.File]::OpenText($inputfilename)
$writer = New-Object System.IO.StreamWriter $outputfilename
$record = $reader.ReadLine()
while ($record -ne $null)
{
$writer.WriteLine(($record -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'))
$record = $reader.ReadLine()
}
$reader.Close()
$reader.Dispose()
$writer.Close()
$writer.Dispose()
}
ProcessFile $filename $outfilename
EDIT: After testing all the suggestions on this page, i have borrowed the regex from Dave Sexton and this is the fastest implementation. Processes a 1gb+ file in 175 seconds. All other implementations are significantly slower on large input files.
Upvotes: 1
Reputation: 11188
If your file sizes are large then reading the complete file contents at once using Import-Csv or ReadAll is probably not a good idea. I would use Get-Content cmdlet using the ReadCount property which will stream the file one row at time and then use a regex for the processing. Something like this:
Get-Content your_in_file.txt -ReadCount 1 | % {
$_ -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'
} | Set-Content your_out_file.txt
EDIT: This version should output faster:
$d = Get-Date
Get-Content input.txt -ReadCount 1000 | % {
$_ | % {
$_ -replace '^(A\||B\|[^\|]+\||C\|[^\|]+\|[^\|]+\|)(.*)$', '$1P$2'
} | Add-Content output.txt
}
(New-TimeSpan $d (Get-Date)).Milliseconds
For me this processed 50k rows in 350 milliseconds. You probably get more speed by tweaking the -ReadCount
value to find the ideal amount.
Upvotes: 1
Reputation: 1716
I would ConvertFrom-Csv
so you can check each line as an object. On this code, I did add a header, but mainly for code readability. The header is cut out of the output on the last line anyway:
$input = "H:\test_modify_contents.txt"
$output = "H:\output_file.txt"
$data = Get-Content -Path $input | ConvertFrom-Csv -Delimiter '|' -Header 'Column1','Column2','Column3','Column4','Column5'
$data | % {
If ($_.Column5) {
#type C:
$_.Column4 = "P$($_.Column4)"
} ElseIf ($_.Column4) {
#type B:
$_.Column3 = "P$($_.Column3)"
} Else {
#type A:
$_.Column2 = "P$($_.Column2)"
}
}
$data | Select Column1,Column2,Column3,Column4,Column5 | ConvertTo-Csv -Delimiter '|' -NoTypeInformation | Select-Object -Skip 1 | Set-Content -Path $output
It does add extra |
for the type A and B lines. Output:
"A"|"Pstuff"|"more_stuff"||
"B"|"123"|"Pother"|"x"|
"C"|"something"|"456"|"Pstuff"|"more_stuff"
"B"|"78903"|"Pstuff"|"x"|
"A"|"P1"|"more_stuff"||
Upvotes: 1