austin7923
austin7923

Reputation: 95

Grouping rows based on column

I was trying to group the results below based on column 1, but was unable to do so.

Input:

rs10923724 C TBX15 intergenic
rs10923724 T WARS2 intron
rs72705210 G AMPD2 upstream
rs72705210 A GSTM4 downstream

Desired output:

rs10923724 C,T TBX15,WARS2 intergenic,intron
rs72705210 G,A AMPD2,GSTM4 upstream,downstream

Codes that I tried:

awk '{ A[$1]=A[$1]", "$2} END { for(X in A) print X"\t",substr(A[X],=2) }'

Output:

rs10923724 C,T
rs72705210 G,A

Upvotes: 0

Views: 155

Answers (1)

karakfa
karakfa

Reputation: 67507

$ awk '{k=$1; 
        for(i=2;i<=NF;i++) a[k,i]=(k in ks)?a[k,i]","$i:$i;
        ks[k]} 
   END {for(k in ks) 
          {printf "%s", k FS; 
           for(i=2;i<=NF;i++) printf "%s", a[k,i] (i==NF?ORS:FS)}}' file

rs72705210 G,A AMPD2,GSTM4 upstream,downstream
rs10923724 C,T TBX15,WARS2 intergenic,intron

group each column by the key and column index, since separator is just between elements adding the first value has special treatment. Keep track of keys separately for later retrieval. At the end, for each key print the aggregated columns; adding the right separator between fields and records based on column index.

awk arrays don't preserve the order for random keys. Sort the result if the order is important.

Upvotes: 1

Related Questions