Paul
Paul

Reputation: 321

How to split files according first column awk

I show example what I need to do:

INPUT:

name value1 value2 value3
john xxxxx yyyyy qqqqqq
john xxxxx ddddd vvvvvv
john mmmmm jjjjj llllll
paul xxxxx yyyyy qqqqqq
paul ccccc ccccc dddddd

and I need to keep header and split into files according same name in first column. I need to keep naming my output files according first column.

Output:

FILE1: john.tsv

name value1 value2 value3
john xxxxx yyyyy qqqqqq
john xxxxx ddddd vvvvvv
john mmmmm jjjjj llllll

FILE2: paul.tsv

name value1 value2 value3
paul xxxxx yyyyy qqqqqq
paul ccccc ccccc dddddd

INPUT and OUTPUT file is tab separate. Header is always same.

My solution is very complicated and slow:

head -1 INPUT > header

awk 'NR>1{print $1}' | sort | uniq > names

while read line

 do grep $line INPUT | cat header - > $line.tsv

< names

done

Upvotes: 2

Views: 1041

Answers (3)

nu11p01n73R
nu11p01n73R

Reputation: 26667

Using awk we can write something like,

$ awk 'NR == 1{header = $0; next} 
    !($1 in filename){ print header > ($1".tsv") } 
    NR > 1 { print $0 > ($1".tsv"); filename[$1] }' file

What it does?

  • NR == 1{header = $0} If number of records read is 1, this is the header, save it in header for later usage.

  • NR > 1 { print $0 > ($1".tsv"); filename[$1] } If we have read more than one record already, print the content of the line to file name $1, ie the first column.

    • filename[$1] We are saving the filename in an associative array indexed by the filename. This array is used to print out the header.
  • ($1 in filename){ print header > ($1".tsv") } If we cannot find the current filename in filename array, it is the first occurrence. So we print out the header into the file.


Edit

If you want to sort the files on the second column, then we can sort them first and then pipe them to awk like,

$ sort -n -k2 file | awk ....
  • -n Numeric sort.
  • -k2 sort on second key.

This may not work, if the header is also numeric.

Upvotes: 6

Ed Morton
Ed Morton

Reputation: 203169

All of the answers posted so far have issues that would make them fragile and/or non-portable (e.g. use of getline without checking it's result, unparenthesized right side of output redirection, use of gawk-specific features, and not closing each output file when done with it) and/or unnecessarily complicated.

To sort your input file by the first 2 columns while retaining the header is:

$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort | cut -f2-
name value1 value2 value3
john mmmmm jjjjj llllll
john xxxxx ddddd vvvvvv
john xxxxx yyyyy qqqqqq
paul ccccc ccccc dddddd
paul xxxxx yyyyy qqqqqq

and to robustly, portably, efficiently print your input including the header line to separate files named based on the first column is:

$ cat tst.awk
NR==1 { hdr=$0; next }
$1 != prev {
    close(out)
    out = $1 ".tsv"
    print hdr > out
    prev = $1
}
{ print > out }

So to put it all together would be:

awk -v OFS='\t' '{print (NR>1), $0}' file | sort | cut -f2- | awk -f tst.awk

Upvotes: 3

karakfa
karakfa

Reputation: 67467

similar to @nu11p01n73R's answer, added sorting of the data portion in the script

$ awk 'NR==1{h=$0; next}
    !p[$1]++{print h > $1} 
            {print | "sort -k2 >> " $1}' file

$ head paul john

==> paul <==
name value1 value2 value3
paul ccccc ccccc dddddd
paul xxxxx yyyyy qqqqqq

==> john <==
name value1 value2 value3
john mmmmm jjjjj llllll
john xxxxx ddddd vvvvvv
john xxxxx yyyyy qqqqqq

Upvotes: 2

Related Questions