Reputation: 333
I have a tab-delimited file with some duplicated rows. I am interested in finding rows with duplicates in the second field (position) and then keeping only the row with the highest value in the third field (quality). For example, if a file looks like below:
chrom 1677 800
chrom 1677 850
chrom 1777 900
chrom 1799 900
I am interested in retaining only a single row for each position that represents the highest quality:
chrom 1677 850
chrom 1777 900
chrom 1799 900
I can find the duplicated positions with: awk 'seen[$2]++' file.txt
But I am not sure how to compare duplicated rows from the second field based on the value in the third field. Any help would be greatly appreciated!
Thank you!
Upvotes: 2
Views: 38
Reputation: 103874
Given:
$ echo "$e"
chrom 1677 800
chrom 1677 850
chrom 1777 900
chrom 1799 900
You can do:
$ echo "$e" | awk '{if (m[$2]<$3) {m[$2]=$3; seen[$2]=$0}}
END {for (e in seen) print seen[e]}'
chrom 1677 850
chrom 1777 900
chrom 1799 900
If your file is tab separated and really a file:
awk -F $'\t' '{if (m[$2]<$3) {m[$2]=$3; seen[$2]=$0}}
END {for (e in seen) print seen[e]}' file
Upvotes: 1
Reputation: 32094
mydir$ echo $'chrom\t1677\t800\nchrom\t1677\t850\nchrom\t1779\t900\nchrom\t1777\t900' > tmp.txt
mydir$ cat tmp.txt
chrom 1677 800
chrom 1677 850
chrom 1779 900
chrom 1777 900
mydir$ sort -t$'\t' -k2,2 -k3,3nr tmp.txt | awk -v FS='\t' 'val!=$2 {val=$2;print}'
chrom 1677 850
chrom 1777 900
chrom 1779 900
Upvotes: 1