jujulalu
jujulalu

Reputation: 97

Extract columns from pipe delimited text file and output using powershell

I need to extract 2 columns (column 147 and column 148) from a huge pipe delimited file (100 MB-3GB). I cannot open in excel or if I do it takes forever. Once the columns are extracted, I need to output the top N rows into another text file but I need to format them so they can be put into SQL so I can query a table. However, there can be many concatenated fields, separated by commas, in the 2 extracted columns (LOGICAL_KEY_CONCAT_FIELD_CLM, LOGICAL_KEY_CONCAT_FIELD_CLM2) and their format can vary from file to file. I need this to be a repeating process.

For example, pipe delimited file looks like:

|||…|LOGICAL_KEY_CONCAT_FIELD_CLM|LOGICAL_KEY_CONCAT_FIELD_CLM2|||||…
|||…|CLM,KEY,1|CLM,FINANCIAL,KEY,1,2018-11-30|||…
|||…|CLM,KEY,2|CLM,FINANCIAL,KEY,2,2018-11-30|||…
|||…|CLM,KEY,3|CLM,FINANCIAL,KEY,3,2018-11-30|||…
.
.
.

OUTPUT:

(LOGICAL_KEY_CONCAT_FIELD_CLM = 'CLM,KEY,1' AND 
LOGICAL_KEY_CONCAT_FIELD_CLM2 = 'CLM,FINANCIAL,KEY,1,2018-11-30')
OR (LOGICAL_KEY_CONCAT_FIELD_CLM = 'CLM,KEY,2' AND 
LOGICAL_KEY_CONCAT_FIELD_CLM2 = 'CLM,FINANCIAL,KEY,2,2018-11-30')
OR (LOGICAL_KEY_CONCAT_FIELD_CLM = 'CLM,KEY,3' AND 
LOGICAL_KEY_CONCAT_FIELD_CLM2 = 'CLM,FINANCIAL,KEY,3,2018-11-30')

Basically, I want to run a query using the data from the columns in the file and I want to ensure the combination of LOGICAL_KEY_CONCAT_FIELD_CLM and LOGICAL_KEY_CONCAT_FIELD_CLM2 are in the table. I don’t need SELECT * FROM Table WHERE because the Table varies depending on the incoming source.

I am using the following code but it doesn’t print the columns in the output file, just the headers:

Get-Content "\\LocationOfFile\CLAIM_20190103T17053920.txt" | select-object 
LOGICAL_KEY_CONCAT_FIELD_CLM,LOGICAL_KEY_CONCAT_FIELD_CLM2 -First 10 | Out- 
File "P:\PDS_QA\TestFile\Output.txt"

Upvotes: 1

Views: 2683

Answers (2)

marsze
marsze

Reputation: 17035

This is probably the fastest option(?)

Since your file is very large, it uses some .NET streams for optimal performance:

$outstream = New-Object System.IO.StreamWriter "P:\PDS_QA\TestFile\Output.txt"
try {
    $outstream.WriteLine("SELECT * FROM Table WHERE 1=0")
    $firstLine = $true
    foreach ($line in [System.IO.File]::ReadLines("\\LocationOfFile\CLAIM_20190103T17053920.txt")) {
        if ($firstLine) {
            # skip the header of the file
            $firstLine = $false
            continue
        }
        $values = $line.Split("|")
        # (-1 because I assume your column numbers are one-based)
        $clm, $clm2 = $values[146, 147]
        $line = "OR (LOGICAL_KEY_CONCAT_FIELD_CLM = '{0}' AND LOGICAL_KEY_CONCAT_FIELD_CLM2 = '{1}')" -f $clm, $clm2
        $outstream.WriteLine($line)
    }
}
finally {
    $outstream.Dispose()
}

Upvotes: 2

marsze
marsze

Reputation: 17035

Alternative solution, with pure built-in PowerShell cmdlets. I can't say how good the performance is for large files. Probably worse than my other answer. You should try it out.

Also, this requires the input CSV to have a header with unique names for every column.

$infile = "\\LocationOfFile\CLAIM_20190103T17053920.txt"
$outfile = "P:\PDS_QA\TestFile\Output.txt"
"SELECT * FROM Table WHERE 1=0" | Out-File $outfile
Import-Csv $infile -Delimiter "|" | foreach {
    "OR (LOGICAL_KEY_CONCAT_FIELD_CLM = '{0}' AND LOGICAL_KEY_CONCAT_FIELD_CLM2 = '{1}')" -f (
    $_.LOGICAL_KEY_CONCAT_FIELD_CLM,
    $_.LOGICAL_KEY_CONCAT_FIELD_CLM2)
} | Out-File $outfile -Append

Upvotes: 1

Related Questions