Reputation: 45

Remove duplicates and keep line which contains max value from one column - LINUX

everyone!

I'd like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.

Before

gene  subj  e-value ident
  g1    h1    0.05   75.5
  g1    h2    0.03   60.6 
  g2    h7    0.00   80.5
  g2    h9    0.00   50.3
  g2    h4    0.03   90.7
  g3    h5    0.10   30.5
  g3    h8    0.00   76.8
  g4    h11   0.00   80.7

After

gene  subj  e-value ident
  g1    h1    0.05   75.5
  g2    h4    0.03   90.7
  g3    h8    0.00   76.8
  g4    h11   0.00   80.7

Thank you so much and I'm sorry if I asked something repeated! But I didn't find an answer for my problem.

Upvotes: 1

Answers (3)

MauricioRobayo

Reputation: 2356

You can try this, if it is no problem to get the output without the header:

tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1

Explanation:

tail -n +2 file.txt

will remove the headers so they don't get involved in all the sorting.

sort -k1,1 -k4,4rn

will sort by column 1 first (-k1,1) and then by column 4 numerically and in reverse order (-k4,4rn)

Finally:

 sort -uk1,1

Will remove duplicates taking into account just the first column.

Be aware that -k1,1 means from column one to column one, hence -k4,4 is from column 4 to column 4. Adjust to fit your columns.

Upvotes: 5

valrog

Reputation: 236

An awk solution, but I like archimiro version for simplicity.

awk '
    NR>1 && $1 in arr {
        if ($4 > arr[$1][4]) 
            split($0, arr[$1])
        next
    } 

    NR>1 { 
        arr[$1][1] = ""
        split($0, arr[$1]) 
    } 

    END { 
        for(i in arr) { 
            for(j in arr[i]) 
                printf arr[i][j] "\t"
            print ""
        }
    }
' data.file

The result:

g1  h1  0.05    75.5    
g2  h4  0.03    90.7    
g3  h8  0.00    76.8    
g4  h11 0.00    80.7

Upvotes: 0

RomanPerekhrest

Reputation: 92904

With GNU datamash tool:

datamash --headers -Wfs -g1 max 4 < file | cut -f1-4

The output:

gene  subj  e-value  ident
g1    h1    0.05     75.5
g2    h4    0.03     90.7
g3    h8    0.00     76.8
g4    h11   0.00     80.7

Upvotes: 0

Remove duplicates and keep line which contains max value from one column - LINUX

Answers (3)

Related Questions