Reputation: 9
Let's assume I have a file which structure looks like this:
AAAA 700 something1 something_else1
AAAA 98 something2 something_else2
AAAA 2000 something3 something_else3
BBBB 200 something4 something_else4
BBBB 21 something5 something_else5
BBBB 300 something6 something_else6
I need to extract, for each value in column $1, the whole line having the highest value in column $1. This means that, for the field AAAA, I would need to print the line in which $2=2000. The output should thus look like:
AAAA 2000 something3 something_else3
BBBB 300 something6 something_else6
I did it with python, but the file is huge and the process is very time-consuming. Is there any way to do it with awk?
Upvotes: 0
Views: 59
Reputation: 67467
a combination of sort/awk will be easiest
$ sort -k1,2nr file | awk '!a[$1]++'
AAAA 2000 something3 something_else3
BBBB 300 something6 something_else6
sort by first field and second field (descending), let awk pick the first rows of the groups (highest by design).
Upvotes: 1
Reputation: 8164
you can try
awk '
!($1 in max) || ($2>max[$1]) {
max[$1]=$2; a[$1]=$0;
}
END{
for(i in a){
print a[i];
}
}' input_file
you get (the order may be different because it depends on hash in a
):
BBBB 300 something6 something_else6 AAAA 2000 something3 something_else3
Upvotes: 1
Reputation: 203229
$ cat tst.awk
$1!=prev { if (rec!="") print rec; max=$2; rec=$0 }
$2 > max { max=$2; rec=$0 }
{ prev=$1 }
END { if (rec!="") print rec }
$ awk -f tst.awk file
AAAA 2000 something3 something_else3
BBBB 300 something6 something_else6
The above assumes the $1
values are always grouped together as shown in your sample input. Given that, it only stores 1 record in memory at a time (since you say your input file is huge that could be important), prints the records in the same order they were read, will work even for zero or negative $2
values, and will not output anything for an empty input file.
Upvotes: 3