Ajni K
Ajni K

Reputation: 13

Text File into csv

I have the following text file that I would like to convert into a csv file:

For better understanding, here an example:

Column number 1 text
=========================
Column number 2 text
More column number 2 text
Col nr 2
More..
<empty line>
<empty line>
<empty line>
Column number 1 text
===============
Col nr 2
Col nr 2
Col nr 2
<empty line>
<empty line>
<empty line>

and so on....

I have tried a lot of powershell commands to do such conversion, but I haven't had any success Could someone help me ? It doesn't have to be a powershell script.

Thank you!

Upvotes: 1

Views: 296

Answers (1)

mklement0
mklement0

Reputation: 440431

Here's a PSv3+ solution that is concise, though perhaps not easy to understand:

Get-Content -Raw file.txt |
  Select-String -AllMatches '(.+)\r?\n=+\r?\n([\s\S]*?)(?:\r?\n){3}' | ForEach-Object {
    $_.Matches | ForEach-Object {
      [pscustomobject] @{
        Col1 = $_.Groups[1].Value
        Col2 = $_.Groups[2].Value
      }
    }
  } | Export-Csv -NoTypeInformation out.csv
  • Note that the input file is read as a whole up front - this may not be an option with large input files.

  • Select-String is used to extract the paragraphs of interest with a regex (regular expression).

    • Capture groups ((...)) inside the regex extract the column-1 and column-2 values.
    • .+ matches all characters on a nonempty line (. means: any character other than \n (LF), and + means: one or more repetitions); in the context of the overal regex, this will capture the column-1 value.
    • \r?\n matches both CRLF newlines (Windows) and LF-only newlines (Unix), to be safe.
    • =+ matches one or more adjacent = characters; i.e., it matches the separator line in the input.
    • [\s\S] is a trick to match any character including \n (LF); by default, . only matches characters except \n; nongreedy repetition symbol *? means that matching should stop as soon as something matches the remainder of the expression; in the context of the overall regex, this will capture the column-2 value.
    • (?:...) is a noncapturing group; we need (...) only to apply quantifier {3} to subexpression \r?\n (exactly 3 repetitions), but we're not interested in capturing what the subexpression matched; in the context of the overall regex, this matches the 3 empty lines at the end of each block (without capturing them).
  • The ForEach-Object call iterates over all all matches and constructs a custom object with property names Col1 and Col2 from the capture-group values, which is implicitly output.

  • Export-Csv then exports the resulting objects in CSV format to output file out.csv; note that the default encoding used by Export-Csv on Windows PowerShell is ASCII, so use the -Encoding parameter as needed.


With your sample input, out.csv ends up containing the following:

"Col1","Col2"
"Column number 1 text","Column number 2 text
More column number 2 text
Col nr 2
More.."
"Column number 1 text","Col nr 2
Col nr 2
Col nr 2"

Note how the Col2 values are multiline values - however, Import-Csv handles such values fine (as long as they're double-quoted).

Upvotes: 1

Related Questions