Reputation: 97
I need to extract 2 columns (column 147 and column 148) from a huge pipe delimited file (100 MB-3GB). I cannot open in excel or if I do it takes forever. Once the columns are extracted, I need to output the top N rows into another text file but I need to format them so they can be put into SQL so I can query a table. However, there can be many concatenated fields, separated by commas, in the 2 extracted columns (LOGICAL_KEY_CONCAT_FIELD_CLM, LOGICAL_KEY_CONCAT_FIELD_CLM2) and their format can vary from file to file. I need this to be a repeating process.
For example, pipe delimited file looks like:
|||…|LOGICAL_KEY_CONCAT_FIELD_CLM|LOGICAL_KEY_CONCAT_FIELD_CLM2|||||…
|||…|CLM,KEY,1|CLM,FINANCIAL,KEY,1,2018-11-30|||…
|||…|CLM,KEY,2|CLM,FINANCIAL,KEY,2,2018-11-30|||…
|||…|CLM,KEY,3|CLM,FINANCIAL,KEY,3,2018-11-30|||…
.
.
.
OUTPUT:
(LOGICAL_KEY_CONCAT_FIELD_CLM = 'CLM,KEY,1' AND
LOGICAL_KEY_CONCAT_FIELD_CLM2 = 'CLM,FINANCIAL,KEY,1,2018-11-30')
OR (LOGICAL_KEY_CONCAT_FIELD_CLM = 'CLM,KEY,2' AND
LOGICAL_KEY_CONCAT_FIELD_CLM2 = 'CLM,FINANCIAL,KEY,2,2018-11-30')
OR (LOGICAL_KEY_CONCAT_FIELD_CLM = 'CLM,KEY,3' AND
LOGICAL_KEY_CONCAT_FIELD_CLM2 = 'CLM,FINANCIAL,KEY,3,2018-11-30')
Basically, I want to run a query using the data from the columns in the file and I want to ensure the combination of LOGICAL_KEY_CONCAT_FIELD_CLM and LOGICAL_KEY_CONCAT_FIELD_CLM2 are in the table. I don’t need SELECT * FROM Table WHERE because the Table varies depending on the incoming source.
I am using the following code but it doesn’t print the columns in the output file, just the headers:
Get-Content "\\LocationOfFile\CLAIM_20190103T17053920.txt" | select-object
LOGICAL_KEY_CONCAT_FIELD_CLM,LOGICAL_KEY_CONCAT_FIELD_CLM2 -First 10 | Out-
File "P:\PDS_QA\TestFile\Output.txt"
Upvotes: 1
Views: 2683
Reputation: 17035
This is probably the fastest option(?)
Since your file is very large, it uses some .NET streams for optimal performance:
$outstream = New-Object System.IO.StreamWriter "P:\PDS_QA\TestFile\Output.txt"
try {
$outstream.WriteLine("SELECT * FROM Table WHERE 1=0")
$firstLine = $true
foreach ($line in [System.IO.File]::ReadLines("\\LocationOfFile\CLAIM_20190103T17053920.txt")) {
if ($firstLine) {
# skip the header of the file
$firstLine = $false
continue
}
$values = $line.Split("|")
# (-1 because I assume your column numbers are one-based)
$clm, $clm2 = $values[146, 147]
$line = "OR (LOGICAL_KEY_CONCAT_FIELD_CLM = '{0}' AND LOGICAL_KEY_CONCAT_FIELD_CLM2 = '{1}')" -f $clm, $clm2
$outstream.WriteLine($line)
}
}
finally {
$outstream.Dispose()
}
Upvotes: 2
Reputation: 17035
Alternative solution, with pure built-in PowerShell cmdlets. I can't say how good the performance is for large files. Probably worse than my other answer. You should try it out.
Also, this requires the input CSV to have a header with unique names for every column.
$infile = "\\LocationOfFile\CLAIM_20190103T17053920.txt"
$outfile = "P:\PDS_QA\TestFile\Output.txt"
"SELECT * FROM Table WHERE 1=0" | Out-File $outfile
Import-Csv $infile -Delimiter "|" | foreach {
"OR (LOGICAL_KEY_CONCAT_FIELD_CLM = '{0}' AND LOGICAL_KEY_CONCAT_FIELD_CLM2 = '{1}')" -f (
$_.LOGICAL_KEY_CONCAT_FIELD_CLM,
$_.LOGICAL_KEY_CONCAT_FIELD_CLM2)
} | Out-File $outfile -Append
Upvotes: 1