Reputation: 1655
I have tab delimited files as shown below:
CNV_chr1_12623251_12632176 8925 3 RR123 XX
CNV_chr1_13398757_13402091 3334 4 RR123 YY
CNV_chr1_13398757_13402091 3334 4 RR224 YY
CNV_chr1_14001365_14004064 2699 1 RR123 YX
CNV_chr1_14001365_14004064 2699 1 RR224 YX
Columns $1 and $2 stay identical. In this case, i would need to remove the duplicate row by indexing with the value in 4th column. and add an additional $5 with number of strings separated by comma in $4. Sample output shown below:
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
CNV_chr1_13398757_13402091 3334 4 RR123,RR124 2 YY
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
Any working soultion would be helpful.
Upvotes: 1
Views: 67
Reputation: 3065
Try this:
awk '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt
The output:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
For tab-delimited input just add -F"\t"
to awk
:
awk -F"\t" '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt
and get:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
Upvotes: 1