ksw
ksw

Reputation: 333

Bash sort based on single field and compare based on another field

I have a tab-delimited file with some duplicated rows. I am interested in finding rows with duplicates in the second field (position) and then keeping only the row with the highest value in the third field (quality). For example, if a file looks like below:

chrom 1677 800
chrom 1677 850
chrom 1777 900
chrom 1799 900

I am interested in retaining only a single row for each position that represents the highest quality:

chrom 1677 850
chrom 1777 900
chrom 1799 900

I can find the duplicated positions with: awk 'seen[$2]++' file.txt

But I am not sure how to compare duplicated rows from the second field based on the value in the third field. Any help would be greatly appreciated!

Thank you!

Upvotes: 2

Views: 38

Answers (2)

dawg
dawg

Reputation: 103874

Given:

$ echo "$e"
chrom 1677 800
chrom 1677 850
chrom 1777 900
chrom 1799 900

You can do:

$ echo "$e" | awk '{if (m[$2]<$3) {m[$2]=$3; seen[$2]=$0}} 
                    END {for (e in seen) print seen[e]}'
chrom 1677 850
chrom 1777 900
chrom 1799 900

If your file is tab separated and really a file:

awk -F $'\t' '{if (m[$2]<$3) {m[$2]=$3; seen[$2]=$0}} 
              END {for (e in seen) print seen[e]}' file

Upvotes: 1

newtover
newtover

Reputation: 32094

mydir$ echo $'chrom\t1677\t800\nchrom\t1677\t850\nchrom\t1779\t900\nchrom\t1777\t900' > tmp.txt
mydir$ cat tmp.txt
chrom   1677    800
chrom   1677    850
chrom   1779    900
chrom   1777    900
mydir$ sort -t$'\t' -k2,2 -k3,3nr tmp.txt | awk -v FS='\t' 'val!=$2 {val=$2;print}'
chrom   1677    850
chrom   1777    900
chrom   1779    900

Upvotes: 1

Related Questions