lexa
lexa

Reputation: 35

Creating a new csv from malformed csv (containing LF and CR characters in between a field)

I have a 4 column csv file that has one of its columns spread across multiple lines as seen below:

Number#,Scenes,OkOrFail,Time(ms)  
1,com.mine[scenario->GRADE_1:thsi is test is request from Eol   ],OK,3613
2,com.mine[scenario->GRADE_900:MSA Harvest all losses   ],OK,1325  
14,com.mine[scenario->GRADE_450:Great lamps Entity with xbars  
Entity used SplitEnt  
Model : silicon8],OK,930  
15,com.mine[scenario->GRADE_985:request picking from the transmitter

Entity used  silicon-B.E0004  
],FAIL,728  
16,com.mine[scenario->GRADE_120:single query from Receiver with  
],OK,1245

To see the unprintable characters in the original file, see below: bad file with breaks

I am looking to create a new csv from the original csv after removing the Number# column, extracting just the GRADE_ strings from the Scenes column (GRADE_1, GRADE_900, etc) and keeping the OkOrFail column.

So the new csv will look like this:

Scenes,OkOrFail,Time(ms)  
GRADE_1,OK,3613  
GRADE_900,OK,1325  
GRADE_450,OK,930  
GRADE_985,FAIL,728  
GRADE_120,OK,1245  

I thought constructing a regex to match the entire line and capturing what I want to output later would work. Here is the regex I have:

^[0-9]+,.+>([A-Z_0-9]+)[^,]+(,[A-Z]+.*)

Now, in Notepad++, this works fine BUT in actual powershell (using version 5.1) only the first two lines work ok.

Here is my code:

$origstring = "^[0-9]+,.+>([A-Z_0-9]+)[^,]+(,[A-Z]+.*)"
$testNameOnly = '$1'
$statusAndDuration = '$2'
$csvfile = "C:\small_bad.csv"

(Get-Content $csvfile) | % {
$_ = $_ -replace "Number#,", ''
$_ = $_ -replace $origstring, ($testNameOnly + $statusAndDuration)
Write-Host $_
}

And the output is:

PS C:\Windows\SysWOW64\WindowsPowerShell\v1.0> 
Scenes,OkOrFail,Time(ms)
GRADE_1,OK,3613
GRADE_900,OK,1325
14,com.mine[scenario->GRADE_450:Great lamps Entity with xbars
Entity used SplitEnt
Model : silicon8],OK,930
15,com.mine[scenario->GRADE_985:request picking from the transmitter

Entity used  silicon-B.E0004
],FAIL,728
16,com.mine[scenario->GRADE_120:single query from Receiver with
],OK,1245

Upvotes: 1

Views: 243

Answers (2)

BigJoe
BigJoe

Reputation: 131

Capture the entire content of the file (instead of a line by line approach) and then do your regex magic:

$x = get-content -raw -path Filename1 ; $x -replace $origstring, ($testNameOnly + $statusAndDuration) | set-content -path Filename2 

Upvotes: 1

TessellatingHeckler
TessellatingHeckler

Reputation: 29048

As Wiktor Stribiżew comments, Get-Content splits the file on newlines and outputs individual lines, but you need your regex to match over multiple lines, so it cannot work. Get-Content -Raw causes it to read the entire file into one string including newlines.

My approach is to match the com.mine[...] text and wrap it in double quotes, thus making the data a valid CSV which Import-Csv can process.

(Get-Content -Raw .\test.txt) -replace '(com\.mine\[[^\]]+\])', '"$1"' |
  ConvertFrom-Csv | Format-List

NB. Assumes there is no way to put a ] symbol inside the com.mine[ data here ] part.

Upvotes: 3

Related Questions