Apex
Apex

Reputation: 1096

if rows are otherwise-identical keep the one with higher value in one field

I have a file that looks like this:

cat f1.csv:

col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Gm15417,6915
AK161599,Zbtb7b,1339
AK161599,Zbtb7b,1475
AK161599,Zbtb7b,1514

What I want to do is to keep one of the otherwise-duplicated rows if they have a greater number on col3. So if the col1 and col2 are the same then keep the row if has the greater number on the col3.

So the desired output should be:

col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514

I used the command below but it does not solve the problem:

cat f1.csv | sort -rnk3 | awk '!x[$3]++'

Any help is appreciated - thanks!

Upvotes: 1

Views: 67

Answers (4)

choroba
choroba

Reputation: 241908

Using sort, you need

sort -t, -k3,3nr file.csv | sort -t, -su -k1,2

The first sort sorts the input numerically by the 3rd column in the descending order. The second sort is stable -s (not all sort implementations support that) and uniques the output by the first two columns, thus leaving the maximum for each combination.

I ignored the header line.

Upvotes: 1

James K. Lowden
James K. Lowden

Reputation: 7837

The answers provided seem a little complicated to me. Here's an answer all in awk:

#! /usr/bin/awk -f

NR == 1 {
    heading = $0
    next
}

{
    key = $1 "," $2
    if( values[key] < $3 ) {
        values[key] = $3
    }
}

END {
    print heading
    for( k in values ) {
        print k "," values[k] | "sort -t, -k1,2"
    }
}
$ ./max.awk -F, max.dat
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203655

$ head -n 1 f1.csv; { tail -n +2 f1.csv | sort -t, -k1,2 -k3rn | awk -F, '!seen[$1,$2]++'; }
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514

or to avoid naming the input file twice (e.g. so it'll work if the input is a pipe):

$ awk '{print (NR>1) "," $0}' f1.csv | sort -t, -k1,1n -k1,2 -k3rn | cut -d',' -f2- | awk -F, '!seen[$1,$2]++'
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Zbtb7b,1339

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133528

with your shown samples, please try following.

awk '
BEGIN{
  FS=OFS=","
}
{ ind = $1 FS $2 }
FNR==1{
  print
  next
}
{
  arr[ind]=(arr[ind]>$NF?arr[ind]:$NF)
}
END{
  for(i in arr){
     print i,arr[i]
  }
}
' Input_file

Explanation: Adding detailed explanation for above.

awk '                        ##Starting awk program from here.
BEGIN{                       ##Starting BEGIN section of this program from here.
  FS=OFS=","                 ##Setting FS, OFS as comma here.
}
{ ind = $1 FS $2 }           ##Setting ind as 1st and 2nd field value here.
FNR==1{                      ##Checking if its first line.
  print                      ##Then print it.
  next                       ##next will skip all further statements from here.
}
{
  arr[ind]=(arr[ind]>$NF?arr[ind]:$NF)  ##Creating arr with index of ind and keeping only higher value after each line comparison of last field.
}
END{                         ##Starting END block of this program from here.
  for(i in arr){             ##Starting a for loop here.
     print i,arr[i]          ##Printing index and array arr value here.
  }
}
' Input_file                 ##Mentioning Input_file name here.

Upvotes: 2

Related Questions