Grouping rows based on column

Question

I was trying to group the results below based on column 1, but was unable to do so.

Input:

rs10923724 C TBX15 intergenic
rs10923724 T WARS2 intron
rs72705210 G AMPD2 upstream
rs72705210 A GSTM4 downstream

Desired output:

rs10923724 C,T TBX15,WARS2 intergenic,intron
rs72705210 G,A AMPD2,GSTM4 upstream,downstream

Codes that I tried:

awk '{ A[$1]=A[$1]", "$2} END { for(X in A) print X"	",substr(A[X],=2) }'

Output:

rs10923724 C,T
rs72705210 G,A

karakfa · Accepted Answer

$ awk '{k=$1; 
        for(i=2;i<=NF;i++) a[k,i]=(k in ks)?a[k,i]","$i:$i;
        ks[k]} 
   END {for(k in ks) 
          {printf "%s", k FS; 
           for(i=2;i<=NF;i++) printf "%s", a[k,i] (i==NF?ORS:FS)}}' file

rs72705210 G,A AMPD2,GSTM4 upstream,downstream
rs10923724 C,T TBX15,WARS2 intergenic,intron

group each column by the key and column index, since separator is just between elements adding the first value has special treatment. Keep track of keys separately for later retrieval. At the end, for each key print the aggregated columns; adding the right separator between fields and records based on column index.

awk arrays don't preserve the order for random keys. Sort the result if the order is important.

Grouping rows based on column

Answers (1)

Related Questions