fugu
fugu

Reputation: 6578

print if column value seen multiple times

I have a tab delimited file:

2L      31651   31752   60      -       18
2L      31660   31761   60      -       18
2L      31685   31786   60      -       18
2L      55854   55955   60      +       33
2L      67008   67109   60      -       37
2L      68606   68707   60      -       41
2L      83548   83649   60      +       56
2L      155486  155587  60      +       118
2L      169998  170099  60      -       131
2L      170000  170101  60      -       131
2L      170015  170116  60      -       131
2L      170025  170126  60      -       131
2L      170055  170156  60      -       131
2L      170062  170163  60      -       131
2L      170067  170168  60      -       131
2L      170116  170217  60      -       131
2L      327889  327990  60      -       283
2L      327908  328009  60      -       283
2L      329343  329444  60      -       284

The 6th column shows the cluster each row belongs to. I only want to keep rows that have more than 3 members per cluster. For example, the first 3 lines all belong to one cluster (cluster 18).

I'm trying awk -F "\t" '++a[$6] > 3' but it's not working as I thought it would. The expected output is for the above example is on cluster with seven rows:

2L      169998  170099  60      -       131
2L      170000  170101  60      -       131
2L      170015  170116  60      -       131
2L      170025  170126  60      -       131
2L      170055  170156  60      -       131
2L      170062  170163  60      -       131
2L      170067  170168  60      -       131
2L      170116  170217  60      -       131

Any help would be appreciated

Upvotes: 0

Views: 163

Answers (2)

James Brown
James Brown

Reputation: 37414

Another in awk:

$ awk '
$6==p || NR==1 {           # check if $6 hasn't changed (compare to p)
    b=b (b==""?"":ORS) $0  # gather buffer 
    p=$6                   # set p
    i++                    # counter
    next }                 # next record
{                          # $6 has changed:
    p=$6                   # set p
    if(i>3)                # if counter > 3
    print b                # output buffer
    b=$0                   # and initialize
    i=1 }                  # counter too
END {                      # in the end
    if(i>3)                # if needed
        print b }          # flush buffer
' file
2L      169998  170099  60      -       131
2L      170000  170101  60      -       131
2L      170015  170116  60      -       131
2L      170025  170126  60      -       131
2L      170055  170156  60      -       131
2L      170062  170163  60      -       131
2L      170067  170168  60      -       131
2L      170116  170217  60      -       131

It can read from pipe also.

Upvotes: 1

user000001
user000001

Reputation: 33337

One approach would be to do two passes on the file:

awk 'NR==FNR{a[$6]++;next}a[$6]>3' file file

It is easy to see what happens if we add some comments:

awk ' NR == FNR { # For the lines of the first file
         a[$6]++  # increment the number of times we found word $6
         next     # skip to the next record, so the following is
      }           # executed only on the second file:
      a[$6]>3     # print the current line if the counter for word $6 is 
                  # above 1
     ' file file  # input the file twice

Upvotes: 1

Related Questions