csparpa
csparpa

Reputation: 546

Bash CSV sorting and unique-ing

a Linux question: I have the CSV file data.csv with the following fields and values

KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt 
...

The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.

What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL. Sorting is not a problem:

$> sort -t, -k1,2 data.csv  # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...

but then how can I filter the rows so that I get what I want, which is:

0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...

Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance

Upvotes: 1

Views: 3030

Answers (3)

choroba
choroba

Reputation: 241728

Perl solution:

perl -aF, -ne '$h{$F[0]} = [@F[1,2]] if $F[1] > $h{$F[0]}[0]
               }{
               print join ",", $_, @{$h{$_}} for sort {$a<=>$b} keys %h' data.csv

Note that the result is different from the one you requested, the first line contains bbb, not ccc.

Upvotes: 0

Kent
Kent

Reputation: 195029

try this line:

your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'

output:

kent$  echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd

The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY

try with your real data, it should work.

also the whole logic (with your sort) could be done by awk in single process.

Upvotes: 4

svante
svante

Reputation: 1385

Use:

$> sort -r data.csv | uniq -w 5 | sort

given your floats are formatted "0.000"-"9.999"

Upvotes: 3

Related Questions