Reputation: 95
I was trying to group the results below based on column 1, but was unable to do so.
Input:
rs10923724 C TBX15 intergenic
rs10923724 T WARS2 intron
rs72705210 G AMPD2 upstream
rs72705210 A GSTM4 downstream
Desired output:
rs10923724 C,T TBX15,WARS2 intergenic,intron
rs72705210 G,A AMPD2,GSTM4 upstream,downstream
Codes that I tried:
awk '{ A[$1]=A[$1]", "$2} END { for(X in A) print X"\t",substr(A[X],=2) }'
Output:
rs10923724 C,T
rs72705210 G,A
Upvotes: 0
Views: 155
Reputation: 67507
$ awk '{k=$1;
for(i=2;i<=NF;i++) a[k,i]=(k in ks)?a[k,i]","$i:$i;
ks[k]}
END {for(k in ks)
{printf "%s", k FS;
for(i=2;i<=NF;i++) printf "%s", a[k,i] (i==NF?ORS:FS)}}' file
rs72705210 G,A AMPD2,GSTM4 upstream,downstream
rs10923724 C,T TBX15,WARS2 intergenic,intron
group each column by the key and column index, since separator is just between elements adding the first value has special treatment. Keep track of keys separately for later retrieval. At the end, for each key print the aggregated columns; adding the right separator between fields and records based on column index.
awk
arrays don't preserve the order for random keys. Sort the result if the order is important.
Upvotes: 1