bapors
bapors

Reputation: 909

Compare multiple rows to pick the one with smallest value

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together. My dataset looks like this:

X1    A 0.38 24.68  2.93
X2    A 0.38 20.22 14.54
X3    A 0.38 20.08 00.48
X3.3  A 0.22 11.55 10.68
C43   B 0.22 11.55 20.08
C4.2  C 0.22 11.55  3.08
C4.5  C 0.22 11.55 31.08
C42   D 0.96 21.15 11.24
C44   D 0.22  1.10  1.24
P1    E 0.42  0.42  0.42
P2    E 0.42  0.42  0.42
P3    E 0.42  0.42  0.42

In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.

If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.

If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.

If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.

If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.

Therefore, my expected output will be:

X1    A 0.38 24.68  2.93
C43   B 0.22 11.55 20.08
C4.5  C 0.22 11.55 31.08
C42   D 0.96 21.15 11.24
P1    E 0.42  0.42  0.42duplicate
P2    E 0.42  0.42  0.42duplicate
P3    E 0.42  0.42  0.42duplicate

What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.

cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr  | awk '!a[$2]++'
X1      A       0.38    24.68   2.93
C43     B       0.22    11.55   20.08
C4.5    C       0.22    11.55   31.08
C42     D       0.96    21.15   11.24
P1      E       0.42    0.42    0.42

I appreciate to learn how to fix it.

Upvotes: 0

Views: 127

Answers (2)

prig9951
prig9951

Reputation: 11

This may not be the most elegant solution but it works

cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr  | awk '!a[$2]++' |  cut -f2-  > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
  rec[NR] = $0
  idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
  irec[NR] = idx
  dup[idx]++
}
END{
  for(i in rec){
    if(dup[irec[i]]> 1){
      print rec[i] "duplicate"
    }else{
      print rec[i]
    }
  }
}'
rm /tmp/fgrep.$$

Upvotes: 1

tshiono
tshiono

Reputation: 22012

I'm afraid the code below is not sophisticated, how about:

awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"

awk -v OFS="\t" '
    NR==FNR {
        vals = $3","$4","$5
        if (max[$2] == "") max[$2] = vals
        else if (max[$2] == vals) dupe[$2] = 1
        next
    } {
        vals = $3","$4","$5
        if (dupe[$2]) $6 = "duplicate"
        if (max[$2] == vals) print
    }' "tmp.txt" "tmp.txt"

rm -f "tmp.txt"
  • It saves the sorted result in a temporary file "tmp.txt".
  • The 2nd awk script processes the temporary file with two passes.
  • In the 1st pass, it extracts the "max value" for each 2nd column.
  • It also detects the duplications and set the variable dupe if found.
  • In the 2nd pass, it assigns the variable $6 to a string duplicate if the line has the dupe flag.
  • Then it prints only the line(s) which have the max value for each 2nd column.

Upvotes: 1

Related Questions