Mishal Ahmed
Mishal Ahmed

Reputation: 191

Split CSV by column value, and keep header

This has been asked many times before but I simply can't implement the solutions properly. I have a large csv named 2017-01.csv, with a date column (it's the second column in the file) and I am splitting the file by date. The original file looks like:

 date
 2017-01-01
 2017-01-01
 2017-01-01
 2017-01-02
 2017-01-02
 2017-01-02

After the split, 2017-01-01.csv looks like

2017-01-01
2017-01-01
2017-01-01

and 2017-01-02.csv looks like

2017-01-02
2017-01-02
2017-01-02

The code I am using is

awk -F ',' '{print > (""$2".csv")}' 2017.csv

Everything works fine but I need to keep the header row. So I tried

awk -F ',' 'NR==1; NR > 1{print > (""$2".csv")}' 2017-01.csv

But I still get the same results without the header row. What am I doing wrong? I read answers to many similar questions on Stackoverflow but I just can't understand what they are doing.

I want this:

2017-01-01.csv should look like

date
2017-01-01
2017-01-01
2017-01-01

2017-01-02.csv should look like

date
2017-01-02
2017-01-02
2017-01-02

Upvotes: 2

Views: 987

Answers (2)

shellter
shellter

Reputation: 37278

awk -F, '
FNR==1{hdr=$2}
 FNR > 1{
   if (! hdrPrinted[$2]){
      print hdr > (""$2".csv")
      hdrPrinted[$2]=$2
  }
  print $1, $2, $3> (""$2".csv")
}' 2017-01.csv

And as a 1-liner

awk -F, ' FNR==1{hdr=$2} FNR > 1{ if (! hdrPrinted[$2]){ print hdr > (""$2".csv"); hdrPrinted[$2]=$2; } print $1, $2, $3> (""$2".csv") }' 2017-01.csv

Produces output

cat 2017\-01\-01.csv
date
  2017-01-01
  2017-01-01
  2017-01-01

cat 2017\-01\-02.csv
date
  2017-01-02
  2017-01-02
  2017-01-02

Note that FNR means FileNumber(of)Record, so each time a new file is opened, the FNR will reset to 1. This may cause problems for specific cases of processing, but generally, I think it is the better approach, allowing you to list multiple files on the cmd line, and process them all in one process.

-----------------

Per reasonable comments below, here is more bullet-proof version which should deal with the case if more than 20 files are listed on the cmd line.

I don't have an easy way to test this, so feedback is welcome.

AND per comments below, it still needs some work, which I don't have time for right now. Look for update Saturday afternoon.

awk -F, ' FNR==1{hdr=$2}  FNR > 1{
      # length() assumes newish gawk version
      if ( length(openFiles) > 20) {
             # close the first/next file in the array
             close(openFiles[++j]".csv")
             openFiles[j]=""
      }
      if (! ($2 in openFiles) ) {
             # put the filename into the openFiles array (just once)
            openFiles[++i]=$2
            }    if (! hdrPrinted[$2]){
   print hdr > (""$2".csv")
  hdrPrinted[$2]=$2   }   print $1, $2, $3> (""$2".csv") 2017-01.csv

IHTH

Edit by Ed Morton:

awk -F, '
FNR==1 { hdr=$0; next}
{
    out = $2 ".csv"
    if (!seen[out]++) {
        print hdr > out
    }
    print >> out
    close(out)
}
' file

Upvotes: 4

AHT
AHT

Reputation: 414

The following is tested on a csv containing multiple columns with column two set to date:

awk -F',' 'prev!=$2{close(prev".csv");print "date" > ($2".csv")}{print $2 > ($2".csv");prev=$2}' Input_file

hth

Upvotes: 0

Related Questions