chas
chas

Reputation: 1655

How to remove duplicate rows and create index in awk

I have tab delimited files as shown below:

CNV_chr1_12623251_12632176  8925    3   RR123   XX
CNV_chr1_13398757_13402091  3334    4   RR123   YY
CNV_chr1_13398757_13402091  3334    4   RR224   YY
CNV_chr1_14001365_14004064  2699    1   RR123   YX
CNV_chr1_14001365_14004064  2699    1   RR224   YX

Columns $1 and $2 stay identical. In this case, i would need to remove the duplicate row by indexing with the value in 4th column. and add an additional $5 with number of strings separated by comma in $4. Sample output shown below:

CNV_chr1_12623251_12632176  8925    3   RR123    1    XX
CNV_chr1_13398757_13402091  3334    4   RR123,RR124    2    YY     
CNV_chr1_14001365_14004064  2699    1   RR123,RR224    2    YX

Any working soultion would be helpful.

Upvotes: 1

Views: 67

Answers (1)

F. Knorr
F. Knorr

Reputation: 3065

Try this:

awk '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
     {br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
     END{for(key in ar){c=split(br[key],s,",")
                        gsub("REPLACE_ME", br[key] FS c, ar[key])
                        print ar[key]}}' test.txt

The output:

CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX

For tab-delimited input just add -F"\t" to awk:

awk -F"\t" '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
            {br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
            END{for(key in ar){c=split(br[key],s,",")
                        gsub("REPLACE_ME", br[key] FS c, ar[key])
                        print ar[key]}}' test.txt

and get:

CNV_chr1_14001365_14004064 2699 1 RR123,RR224   2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224   2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX

Upvotes: 1

Related Questions