Reputation: 45
everyone!
I'd like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.
Before
gene subj e-value ident
g1 h1 0.05 75.5
g1 h2 0.03 60.6
g2 h7 0.00 80.5
g2 h9 0.00 50.3
g2 h4 0.03 90.7
g3 h5 0.10 30.5
g3 h8 0.00 76.8
g4 h11 0.00 80.7
After
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
Thank you so much and I'm sorry if I asked something repeated! But I didn't find an answer for my problem.
Upvotes: 1
Views: 1664
Reputation: 2356
You can try this, if it is no problem to get the output without the header:
tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1
Explanation:
tail -n +2 file.txt
will remove the headers so they don't get involved in all the sorting.
sort -k1,1 -k4,4rn
will sort by column 1 first (-k1,1
) and then by column 4 numerically and in reverse order (-k4,4rn
)
Finally:
sort -uk1,1
Will remove duplicates taking into account just the first column.
Be aware that -k1,1
means from column one to column one, hence -k4,4
is from column 4 to column 4. Adjust to fit your columns.
Upvotes: 5
Reputation: 236
An awk solution, but I like archimiro version for simplicity.
awk '
NR>1 && $1 in arr {
if ($4 > arr[$1][4])
split($0, arr[$1])
next
}
NR>1 {
arr[$1][1] = ""
split($0, arr[$1])
}
END {
for(i in arr) {
for(j in arr[i])
printf arr[i][j] "\t"
print ""
}
}
' data.file
The result:
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
Upvotes: 0
Reputation: 92904
With GNU datamash tool:
datamash --headers -Wfs -g1 max 4 < file | cut -f1-4
The output:
gene subj e-value ident
g1 h1 0.05 75.5
g2 h4 0.03 90.7
g3 h8 0.00 76.8
g4 h11 0.00 80.7
Upvotes: 0