Reputation: 191
This has been asked many times before but I simply can't implement the solutions properly. I have a large csv named 2017-01.csv, with a date column (it's the second column in the file) and I am splitting the file by date. The original file looks like:
date
2017-01-01
2017-01-01
2017-01-01
2017-01-02
2017-01-02
2017-01-02
After the split, 2017-01-01.csv looks like
2017-01-01
2017-01-01
2017-01-01
and 2017-01-02.csv looks like
2017-01-02
2017-01-02
2017-01-02
The code I am using is
awk -F ',' '{print > (""$2".csv")}' 2017.csv
Everything works fine but I need to keep the header row. So I tried
awk -F ',' 'NR==1; NR > 1{print > (""$2".csv")}' 2017-01.csv
But I still get the same results without the header row. What am I doing wrong? I read answers to many similar questions on Stackoverflow but I just can't understand what they are doing.
I want this:
2017-01-01.csv should look like
date
2017-01-01
2017-01-01
2017-01-01
2017-01-02.csv should look like
date
2017-01-02
2017-01-02
2017-01-02
Upvotes: 2
Views: 987
Reputation: 37278
awk -F, '
FNR==1{hdr=$2}
FNR > 1{
if (! hdrPrinted[$2]){
print hdr > (""$2".csv")
hdrPrinted[$2]=$2
}
print $1, $2, $3> (""$2".csv")
}' 2017-01.csv
And as a 1-liner
awk -F, ' FNR==1{hdr=$2} FNR > 1{ if (! hdrPrinted[$2]){ print hdr > (""$2".csv"); hdrPrinted[$2]=$2; } print $1, $2, $3> (""$2".csv") }' 2017-01.csv
Produces output
cat 2017\-01\-01.csv
date
2017-01-01
2017-01-01
2017-01-01
cat 2017\-01\-02.csv
date
2017-01-02
2017-01-02
2017-01-02
Note that FNR
means FileNumber(of)Record, so each time a new file is opened, the FNR will reset to 1. This may cause problems for specific cases of processing, but generally, I think it is the better approach, allowing you to list multiple files on the cmd line, and process them all in one process.
Per reasonable comments below, here is more bullet-proof version which should deal with the case if more than 20 files are listed on the cmd line.
I don't have an easy way to test this, so feedback is welcome.
AND per comments below, it still needs some work, which I don't have time for right now. Look for update Saturday afternoon.
awk -F, ' FNR==1{hdr=$2} FNR > 1{
# length() assumes newish gawk version
if ( length(openFiles) > 20) {
# close the first/next file in the array
close(openFiles[++j]".csv")
openFiles[j]=""
}
if (! ($2 in openFiles) ) {
# put the filename into the openFiles array (just once)
openFiles[++i]=$2
} if (! hdrPrinted[$2]){
print hdr > (""$2".csv")
hdrPrinted[$2]=$2 } print $1, $2, $3> (""$2".csv") 2017-01.csv
IHTH
Edit by Ed Morton:
awk -F, '
FNR==1 { hdr=$0; next}
{
out = $2 ".csv"
if (!seen[out]++) {
print hdr > out
}
print >> out
close(out)
}
' file
Upvotes: 4
Reputation: 414
The following is tested on a csv containing multiple columns with column two set to date:
awk -F',' 'prev!=$2{close(prev".csv");print "date" > ($2".csv")}{print $2 > ($2".csv");prev=$2}' Input_file
hth
Upvotes: 0