Reputation: 13
I have the following text file that I would like to convert into a csv file:
For better understanding, here an example:
Column number 1 text
=========================
Column number 2 text
More column number 2 text
Col nr 2
More..
<empty line>
<empty line>
<empty line>
Column number 1 text
===============
Col nr 2
Col nr 2
Col nr 2
<empty line>
<empty line>
<empty line>
and so on....
I have tried a lot of powershell commands to do such conversion, but I haven't had any success Could someone help me ? It doesn't have to be a powershell script.
Thank you!
Upvotes: 1
Views: 296
Reputation: 440431
Here's a PSv3+ solution that is concise, though perhaps not easy to understand:
Get-Content -Raw file.txt |
Select-String -AllMatches '(.+)\r?\n=+\r?\n([\s\S]*?)(?:\r?\n){3}' | ForEach-Object {
$_.Matches | ForEach-Object {
[pscustomobject] @{
Col1 = $_.Groups[1].Value
Col2 = $_.Groups[2].Value
}
}
} | Export-Csv -NoTypeInformation out.csv
Note that the input file is read as a whole up front - this may not be an option with large input files.
Select-String
is used to extract the paragraphs of interest with a regex (regular expression).
(...)
) inside the regex extract the column-1 and column-2 values..+
matches all characters on a nonempty line (.
means: any character other than \n
(LF), and +
means: one or more repetitions); in the context of the overal regex, this will capture the column-1 value. \r?\n
matches both CRLF newlines (Windows) and LF-only newlines (Unix), to be safe.=+
matches one or more adjacent =
characters; i.e., it matches the separator line in the input.[\s\S]
is a trick to match any character including \n
(LF); by default, .
only matches characters except \n
; nongreedy repetition symbol *?
means that matching should stop as soon as something matches the remainder of the expression; in the context of the overall regex, this will capture the column-2 value.(?:...)
is a noncapturing group; we need (...)
only to apply quantifier {3}
to subexpression \r?\n
(exactly 3 repetitions), but we're not interested in capturing what the subexpression matched; in the context of the overall regex, this matches the 3 empty lines at the end of each block (without capturing them).The ForEach-Object
call iterates over all all matches and constructs a custom object with property names Col1
and Col2
from the capture-group values, which is implicitly output.
Export-Csv
then exports the resulting objects in CSV format to output file out.csv
; note that the default encoding used by Export-Csv
on Windows PowerShell is ASCII
, so use the -Encoding
parameter as needed.
With your sample input, out.csv
ends up containing the following:
"Col1","Col2"
"Column number 1 text","Column number 2 text
More column number 2 text
Col nr 2
More.."
"Column number 1 text","Col nr 2
Col nr 2
Col nr 2"
Note how the Col2
values are multiline values - however, Import-Csv
handles such values fine (as long as they're double-quoted).
Upvotes: 1