ArgBlarg101
ArgBlarg101

Reputation: 15

sh Break CSV files by value of column name matched while retaining header

I have a directory with many CSV files from table exports

tblA.csv

A,B,C
1,1,1
1,2,2
2,2,2
3,3,3

tblB.csv

C,D,A
1,1,1
1,2,2
2,2,2
3,3,3

In order to break the files I found this script

   awk -F, '
      NR== 1 { hdr = $0;next}
      {out = "File" $1 ".csv"}
      printed[$1]++<1 {print hdr >out}
      {print $0 > out}
  ' tblA.csv 

that creates 3 files

A,B,C
1,1,1
1,2,2
A,B,C
2,2,2
A,B,C
3,3,3

Now for tblB.csv I still need to break the file by column A but that column is the 3rd not the 1st

I can't figure out a way to pass the argument A have the it loop the header to find a column name that matches the argument I passed and then break the file using that column value. Or if that column name does not exist just skip the file.

Upvotes: 1

Views: 186

Answers (2)

Ed Morton
Ed Morton

Reputation: 203597

Here is how to print a column by it's name:

Print a column by it's name:

$ cat tst.awk
BEGIN { FS="," }
NR==1 {
    for (i=1; i<=NF; i++) {
        f[$i] = i
    }
}
{ print $(f["A"]) }

$ awk -f tst.awk tblA.csv
A
1
1
2
3

$ awk -f tst.awk tblB.csv
A
1
2
2
3

and here's how to use that idiom to do what you asked for robustly and efficiently using any awk* in any shell on every Unix box:

Split the input file by grouped key values:

$ cat tst.awk
BEGIN { FS="," }
NR==1 {
    for (i=1; i<=NF; i++) {
        f[$i] = i
    }
    hdr = $0
    next
}
!(tgt in f) { exit }
{ curr = $(f[tgt]) }
curr != prev {
    close(out)
    out = "File" curr ".csv"
    print hdr > out
    prev = curr
}
{ print > out }

$ awk -v tgt='A' -f tst.awk tblA.csv

$ head File*.csv
==> File1.csv <==
A,B,C
1,1,1
1,2,2

==> File2.csv <==
A,B,C
2,2,2

==> File3.csv <==
A,B,C
3,3,3

$ awk -v tgt='A' -f tst.awk tblB.csv

$ head File*.csv
==> File1.csv <==
C,D,A
1,1,1

==> File2.csv <==
C,D,A
1,2,2
2,2,2

==> File3.csv <==
C,D,A
3,3,3

The above assumes that the input files are grouped by the key field as shown in your sample input - if they aren't then it can be handled in the awk script:

Split the input file by non-grouped key values using AWK only:

$ cat tblC.csv
C,D,A
2,2,3
1,2,2
1,1,3
3,3,1

$ cat tst.awk
BEGIN { FS="," }
NR==1 {
    for (i=1; i<=NF; i++) {
        f[$i] = i
    }
    hdr = $0
    next
}
{ curr = $(f[tgt]) }
curr != prev {
    close(out)
    out = "File" curr ".csv"
    if ( !doneHdr[curr]++ ) {
        print hdr > out
    }
    prev = curr
}
{ print >> out }

$ awk -v tgt='A' -f tst.awk tblC.csv

$ head File*.csv
==> File1.csv <==
C,D,A
3,3,1

==> File2.csv <==
C,D,A
1,2,2

==> File3.csv <==
C,D,A
2,2,3
1,1,3

but if your files are large it's more efficient to sort them so the key values are grouped before running the awk script so awk doesn't have to repeatedly open/close the output files:

Split the input file by non-grouped key values using sort+AWK (more efficient for large files):

$ cat tst.sh
#!/usr/bin/env bash

tgt="$1"
shift

awk -v tgt="$tgt" '
    BEGIN { FS=","; OFS="\t" }
    NR==1 {
        for (i=1; i<=NF; i++) {
            f[$i] = i
        }
    }
    { print (NR>1), $(f[tgt]), NR, $0 }
' "${@:--}" |
sort -k1,1n -k2,2 -k3,3n |
cut -f4- |
awk -v tgt="$tgt" '
    BEGIN { FS="," }
    NR==1 {
        for (i=1; i<=NF; i++) {
            f[$i] = i
        }
        hdr = $0
        next
    }
    { curr = $(f[tgt]) }
    curr != prev {
        close(out)
        out = "File" curr ".csv"
        print hdr > out
        prev = curr
    }
    { print > out }
'

$ ./tst.sh 'A' tblC.csv

$ head File*.csv
==> File1.csv <==
C,D,A
3,3,1

==> File2.csv <==
C,D,A
1,2,2

==> File3.csv <==
C,D,A
2,2,3
1,1,3

The above works by first using awk to decorate the original input by prepending to each line:

  1. NR>1 = a header-or-not, 0-or-1 indicator so we can ensure the header always still comes first after the sort,
  2. $(f[tgt]) = the key value we want to sort on,
  3. NR = the current line number so we get the same order output as we had input for duplicate keys (can also be done without adding that and then using GNU sort for -s)

then we sort by those fields and then remove them again using cut (could do that in the subsequent awk script but cut is efficient and avoids cluttering it) before the main awk script starts creating the output files.

Things to beware of in such AWK scripts:

* In case you get other answers that do either of the following things, just be aware that:

  1. any solution that doesn't close() the output files will fail with "too many open files" in most awks once you get past a threshold which may be as low as 15 and even an awk that supports unlimited "open" files such as GNU awk will slow down past that threshold as it has to manage all those "open" files by actually opening/closing them in the OS as needed behind the scenes, and
  2. any solution that uses print > "File" $1 ".csv" or similar without parens around the expression on the right side of the > will fail with a syntax error in most awks as that's undefined behavior.

Upvotes: 5

glenn jackman
glenn jackman

Reputation: 246817

This is where it's handy to set awk variables on the command line, interspersed with the filenames:

awk -F, '
    FNR== 1 {hdr = $0; next}
    {out = "File" $col ".csv"}
    printed[$1]++ < 1 {print hdr > out}
    {print > out}
' col=1 tblA.csv col=3 tblB.csv

Upvotes: 2

Related Questions