maxxyme
maxxyme

Reputation: 2334

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g

1stHeader   2ndHeader   DateHeader  OtherHeaders...
111111111   SOME STRING 2020-08-01  OTHER STRINGS..
222222222   ANOT STRING 2020-08-02  OTHER STRINGS..

I have to split them according to the 3rd column here, it's a date.

Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one and containing: same header on the 1st line + matching rows as the following lines.

At first I thought about saving (once) the header in a single file e.g.

 sed -n '1w Company_header.csv' Company_*.csv

then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.

sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv

... and at last, insert the (missing) header in each generated file.

But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).

So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?

Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...

Upvotes: 0

Views: 50

Answers (1)

choroba
choroba

Reputation: 241988

Perl to the rescue!

perl -e 'while (<>) {
    $h = $_, next if $. == 1;
    $. = 0 if eof;
    @c = split /\t/;
    open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
    print {$out} $h unless tell $out;
    print {$out} $_;
}' -- Company_*.csv
  • The diamond operator <> in scalar context reads a line from the input.
  • The first line of each file is stored in the variable $h, see $. and eof
  • split populates the @c array by the column values for each line
  • $c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
  • print prints the header if the file is empty (see tell)
  • and prints the current line, too.

Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Upvotes: 1

Related Questions