Reputation: 909
I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5
. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column
is the same value with another row, then I compare their values in the third column
and pick the row with the highest value in the third column
.
If the rows have the same second and third columns
, then I go to forth column
and compare their values in this column, and then get row with the highest value.
If the rows sharing second column
still share the values in third and forth columns
, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns
are the same (complete duplicates), then I print them all, but add 'duplicate'
next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
Upvotes: 0
Views: 127
Reputation: 11
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$
Upvotes: 1
Reputation: 22012
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
awk
script processes the temporary file with two passes.dupe
if found.$6
to a string duplicate
if the line has the dupe flag.Upvotes: 1