Reputation: 7

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.

For example the data will look like,

cat data.txt
A    A    10
A    A    1
A    A    2
A    A    5
A    A    8
A    B    1
A    B    2
A    C    6
A    C    5
A    C    10
A    C    1
B    A    1
B    A    1
B    A    2
B    A    8

And for the result I want

A    A    10
A    A    8
A    A    5
A    B    2
A    B    1
A    C    10
A    C    6
A    C    5
B    A    1
B    A    1
B    A    2

Note that some of the "groups" do not contain 3 rows.

I have tried

sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt 
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt 
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt

It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.

Upvotes: 0

Answers (3)

dawg

Reputation: 103844

You can do:

$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A    A    10
A    A    8
A    A    5
A    B    2
A    B    1
A    C    10
A    C    6
A    C    5
B    A    8
B    A    2
B    A    1

Explanation:

There are two key items to understand the awk program; associative arrays and fields.

If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.

You state If first two columns are equal...

The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.

You then state ...select top 3 based on descending order of 3rd column...

Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.

awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.

A wordier version of the program would be:

awk 'a[$1,$2]++<3 {print $0}'

But the default action if the condition is true is to print $0 so it is not needed.

If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.

Great place to start is the online book Effective AWK Programming by Arnold D. Robbins

Upvotes: 3

glenn jackman

Reputation: 246827

@Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:

sort -k1,2 -k3,3nr file | 
awk '
    {key = $1 FS $2} 
    prev != key {prev = key; count = 1} 
    count <= 3 {print; count++}
'

Upvotes: 1

choroba

Reputation: 241898

You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.

sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
    if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
        ((c++))
    else
        c=0
    fi
    if (( c < 3 )) ; then
        echo $c1 $c2 $n
        l1=$c1
        l2=$c2
    fi
done

Upvotes: 0

If first two columns are equal, select top 3 based on descending order of 3rd column

Answers (3)

Explanation:

Related Questions