Reputation: 7
I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
Upvotes: 0
Views: 69
Reputation: 103844
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2]
uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++
counts them. Now just count up to three.
awk
is organized into blocks of condition {action}
The statement a[$1,$2]++<3
is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0
so it is not needed.
If you are processing text in Unix, you should get to know awk
. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
Upvotes: 3
Reputation: 246827
@Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
Upvotes: 1
Reputation: 241898
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done
Upvotes: 0