Reputation: 6578
I have a tab delimited file:
2L 31651 31752 60 - 18
2L 31660 31761 60 - 18
2L 31685 31786 60 - 18
2L 55854 55955 60 + 33
2L 67008 67109 60 - 37
2L 68606 68707 60 - 41
2L 83548 83649 60 + 56
2L 155486 155587 60 + 118
2L 169998 170099 60 - 131
2L 170000 170101 60 - 131
2L 170015 170116 60 - 131
2L 170025 170126 60 - 131
2L 170055 170156 60 - 131
2L 170062 170163 60 - 131
2L 170067 170168 60 - 131
2L 170116 170217 60 - 131
2L 327889 327990 60 - 283
2L 327908 328009 60 - 283
2L 329343 329444 60 - 284
The 6th column shows the cluster each row belongs to. I only want to keep rows that have more than 3 members per cluster. For example, the first 3 lines all belong to one cluster (cluster 18).
I'm trying awk -F "\t" '++a[$6] > 3'
but it's not working as I thought it would. The expected output is for the above example is on cluster with seven rows:
2L 169998 170099 60 - 131
2L 170000 170101 60 - 131
2L 170015 170116 60 - 131
2L 170025 170126 60 - 131
2L 170055 170156 60 - 131
2L 170062 170163 60 - 131
2L 170067 170168 60 - 131
2L 170116 170217 60 - 131
Any help would be appreciated
Upvotes: 0
Views: 163
Reputation: 37414
Another in awk:
$ awk '
$6==p || NR==1 { # check if $6 hasn't changed (compare to p)
b=b (b==""?"":ORS) $0 # gather buffer
p=$6 # set p
i++ # counter
next } # next record
{ # $6 has changed:
p=$6 # set p
if(i>3) # if counter > 3
print b # output buffer
b=$0 # and initialize
i=1 } # counter too
END { # in the end
if(i>3) # if needed
print b } # flush buffer
' file
2L 169998 170099 60 - 131
2L 170000 170101 60 - 131
2L 170015 170116 60 - 131
2L 170025 170126 60 - 131
2L 170055 170156 60 - 131
2L 170062 170163 60 - 131
2L 170067 170168 60 - 131
2L 170116 170217 60 - 131
It can read from pipe also.
Upvotes: 1
Reputation: 33337
One approach would be to do two passes on the file:
awk 'NR==FNR{a[$6]++;next}a[$6]>3' file file
It is easy to see what happens if we add some comments:
awk ' NR == FNR { # For the lines of the first file
a[$6]++ # increment the number of times we found word $6
next # skip to the next record, so the following is
} # executed only on the second file:
a[$6]>3 # print the current line if the counter for word $6 is
# above 1
' file file # input the file twice
Upvotes: 1