Reputation: 1212

How to remove rows of a file whose value in the second column appears a specific number of times?

I have a file with rows of integers. I want to remove rows where the value in the second column does not appear in any other row's second column.

ex:

4 5 6 // row 0
1 2 3 // row 1
3 3 3 // row 2
4 5 4 // row 3
3 4 5 // row 4
2 2 4 // row 5

Column 2 has values 3 and 4 that only occur once out of all the rows. I want to remove rows that have these unique column 2 values, rows 2 and 4.

I need these sorted by column 2 also so what I currently have is

$ sort -k2 -n filename > filename_sorted

But I'm not sure how to identify the rows whose column 2 value is unique and remove them.

Upvotes: 1

Answers (3)

Reputation: 785186

You can use this sort + awk:

sort -k2 -n file | awk 'p == $2{if (r != "") print r; print; r=""} p != $2{p=$2; r=$0}'
1 2 3 // row 1
2 2 4 // row 5
4 5 4 // row 3
4 5 6 // row 0

Upvotes: 0

Reputation: 289725

Just process the file twice: first to count the amount of times every 2nd column value appears and secondly to print those who appear more than once:

$ awk 'FNR==NR{a[$2]++; next} a[$2]>1' file file
4 5 6 // row 0
1 2 3 // row 1
4 5 4 // row 3
2 2 4 // row 5

Upvotes: 3

Reputation: 4841

This should do the filtering on any awk:

awk '
    {
        seen[$2]++ 
        line[$2]=$0
    }
END {
    for (i in seen)
        if (seen[i]==1)
            print line[i]
}' file

For filtering, if you happen to use gawk 4x, you can do this in one go by adding the following:

BEGIN {PROCINFO["sorted_in"] = "@val_num_asc"}

Otherwise, the easiest is to pipe it to sort:

... | sort -nk2

Upvotes: 0