Angelo
Angelo

Reputation: 5059

getting unique rows based on sum of values in 2,3,4 column

I making a dictionary for my analysis in R, and this dictionary need to have unique labels.

My data looks like this

Labels  t1  t2  t3
gene1   0.000000E+00    0.000000E+00    1.138501E-01
gene2   0.000000E+00    0.000000E+00    9.550272E-02
gene3   0.000000E+00    1.851936E-02    1.019907E-01
gene4   8.212816E-02    0.000000E+00    6.570984E+00
gene5   1.282434E-01    0.000000E+00    6.240799E+00
gene6   2.918929E-01    8.453281E-01    3.387610E+00
gene7   0.000000E+00    1.923038E-01    0.000000E+00
gene7   0.000000E+00    00000000E+00    0.000000E+00
gene8   1.135057E+00    0.000000E+00    2.491100E+00
gene9   7.935625E-01    1.070320E-01    2.439292E+00
gene10  5.046790E+00    0.000000E+00    2.459273E+00
gene11  3.293614E-01    0.000000E+00    2.380152E+00
gene11  3.293614E-01    0.000000E+00    2.380152E+00
gene12  0.000000E+00    0.000000E+00    1.474757E-01
gene13  0.000000E+00    0.000000E+00    1.521591E-01
gene14  0.000000E+00    9.968809E-02    8.387166E-01
gene15  0.000000E+00    1.065761E-01    0.000000E+00

here in the table label gene7 appears two times, in such a case the values in all columns must be added and the value with the highest value should be selected and the other one should be discarded. In case of tie as in case of gene11 first one should be choosen and the others be discarded.

I have tried with sort -u k2V,2 and many combinations but it failed.

Any one liners in awk/python

kindly help

Thank you

Upvotes: 0

Views: 86

Answers (2)

Kent
Kent

Reputation: 195209

this line:

awk 'NR>1{s=$2+$3+$4;if(($1 in a && s>m[$1])|| !($1 in a)){m[$1]=s;a[$1]=$0}}END{for(x in a)print a[x]}' file|sort -V

outputs:

gene1   0.000000E+00    0.000000E+00    1.138501E-01
gene2   0.000000E+00    0.000000E+00    9.550272E-02
gene3   0.000000E+00    1.851936E-02    1.019907E-01
gene4   8.212816E-02    0.000000E+00    6.570984E+00
gene5   1.282434E-01    0.000000E+00    6.240799E+00
gene6   2.918929E-01    8.453281E-01    3.387610E+00
gene7   0.000000E+00    1.923038E-01    0.000000E+00
gene8   1.135057E+00    0.000000E+00    2.491100E+00
gene9   7.935625E-01    1.070320E-01    2.439292E+00
gene10  5.046790E+00    0.000000E+00    2.459273E+00
gene11  3.293614E-01    0.000000E+00    2.380152E+00
gene12  0.000000E+00    0.000000E+00    1.474757E-01
gene13  0.000000E+00    0.000000E+00    1.521591E-01
gene14  0.000000E+00    9.968809E-02    8.387166E-01
gene15  0.000000E+00    1.065761E-01    0.000000E+00

break it down:

awk 'NR>1{s=$2+$3+$4;
    if(($1 in a && s>m[$1])|| !($1 in a)){m[$1]=s;a[$1]=$0}
}END{for(x in a)print a[x]}' file|sort -V

Upvotes: 2

hd1
hd1

Reputation: 34677

Add a column for the sums of columns 2,3,4 using my.data <- cbind(my.data, sum=sum(my.data[,c(2:4)]) and then my.data <- my.data[order(my.data$sum),], finally, use as.vector(table(my.data)) to keep the highest numbered values only. No need for awk or python nor any other tool. Built-in R does the job just fine.

Upvotes: 2

Related Questions