Reputation: 321
I show example what I need to do:
INPUT:
name value1 value2 value3
john xxxxx yyyyy qqqqqq
john xxxxx ddddd vvvvvv
john mmmmm jjjjj llllll
paul xxxxx yyyyy qqqqqq
paul ccccc ccccc dddddd
and I need to keep header and split into files according same name in first column. I need to keep naming my output files according first column.
Output:
FILE1: john.tsv
name value1 value2 value3
john xxxxx yyyyy qqqqqq
john xxxxx ddddd vvvvvv
john mmmmm jjjjj llllll
FILE2: paul.tsv
name value1 value2 value3
paul xxxxx yyyyy qqqqqq
paul ccccc ccccc dddddd
INPUT and OUTPUT file is tab separate. Header is always same.
My solution is very complicated and slow:
head -1 INPUT > header
awk 'NR>1{print $1}' | sort | uniq > names
while read line
do grep $line INPUT | cat header - > $line.tsv
< names
done
Upvotes: 2
Views: 1041
Reputation: 26667
Using awk
we can write something like,
$ awk 'NR == 1{header = $0; next}
!($1 in filename){ print header > ($1".tsv") }
NR > 1 { print $0 > ($1".tsv"); filename[$1] }' file
What it does?
NR == 1{header = $0}
If number of records read is 1
, this is the header, save it in header
for later usage.
NR > 1 { print $0 > ($1".tsv"); filename[$1] }
If we have read more than one record already, print the content of the line to file name $1
, ie the first column.
filename[$1]
We are saving the filename in an associative array indexed by the filename. This array is used to print out the header.($1 in filename){ print header > ($1".tsv") }
If we cannot find the current filename in filename
array, it is the first occurrence. So we print out the header into the file.
Edit
If you want to sort the files on the second column, then we can sort them first and then pipe them to awk
like,
$ sort -n -k2 file | awk ....
-n
Numeric sort.-k2
sort on second key.This may not work, if the header is also numeric.
Upvotes: 6
Reputation: 203169
All of the answers posted so far have issues that would make them fragile and/or non-portable (e.g. use of getline without checking it's result, unparenthesized right side of output redirection, use of gawk-specific features, and not closing each output file when done with it) and/or unnecessarily complicated.
To sort your input file by the first 2 columns while retaining the header is:
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort | cut -f2-
name value1 value2 value3
john mmmmm jjjjj llllll
john xxxxx ddddd vvvvvv
john xxxxx yyyyy qqqqqq
paul ccccc ccccc dddddd
paul xxxxx yyyyy qqqqqq
and to robustly, portably, efficiently print your input including the header line to separate files named based on the first column is:
$ cat tst.awk
NR==1 { hdr=$0; next }
$1 != prev {
close(out)
out = $1 ".tsv"
print hdr > out
prev = $1
}
{ print > out }
So to put it all together would be:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort | cut -f2- | awk -f tst.awk
Upvotes: 3
Reputation: 67467
similar to @nu11p01n73R's answer, added sorting of the data portion in the script
$ awk 'NR==1{h=$0; next}
!p[$1]++{print h > $1}
{print | "sort -k2 >> " $1}' file
$ head paul john
==> paul <==
name value1 value2 value3
paul ccccc ccccc dddddd
paul xxxxx yyyyy qqqqqq
==> john <==
name value1 value2 value3
john mmmmm jjjjj llllll
john xxxxx ddddd vvvvvv
john xxxxx yyyyy qqqqqq
Upvotes: 2