Reputation: 1061
I have a file:
scaffold_0 11498
scaffold_0 11501
scaffold_0 11728 "RHOH"
scaffold_0 12144 "RHOH"
scaffold_0 20708 "RHOH"
scaffold_0 23579 "RHOH"
scaffold_0 130818
scaffold_0 200485 "NSUN7"
scaffold_0 209928 "NSUN7"
scaffold_0 212965 "NSUN7"
scaffold_0 214055 "APBB2"
scaffold_0 223404
scaffold_0 223686 "APBB2"
scaffold_0 227687 "APBB2"
scaffold_0 306105 "APBB2"
scaffold_0 307000 "APBB2"
scaffold_0 391742
scaffold_0 399332 "UCHL1"
scaffold_0 406726 "UCHL1"
scaffold_0 482215
scaffold_0 484921
scaffold_0 538855 "LIMCH1"
scaffold_0 539051 "LIMCH1"
scaffold_0 539819
scaffold_0 543347 "LIMCH1"
scaffold_0 568182 "LIMCH1"
scaffold_0 570321
scaffold_0 570325
scaffold_0 577502 "LIMCH1"
scaffold_0 578933 "LIMCH1"
scaffold_0 621330 "PHOX2B"
scaffold_0 623303 "PHOX2B"
scaffold_0 640271
scaffold_0 667510 "gene3"
scaffold_0 679096
scaffold_0 698659 "TMEM33"
scaffold_0 700427 "TMEM33"
and I want to print the lines for which the items in the 3rd column are repeated 3 or more times. So that these lines are deleted:
scaffold_0 399332 "UCHL1"
scaffold_0 406726 "UCHL1"
scaffold_0 621330 "PHOX2B"
scaffold_0 623303 "PHOX2B"
scaffold_0 667510 "gene3"
scaffold_0 698659 "TMEM33"
scaffold_0 700427 "TMEM33"
I would be happy to have the order of the file kept and the lines for which the 3rd column is empty stay. I tried:
sort -k3 file.txt | awk 'a[$3]++{ if(a[$3]>=2){ print b }; print $0}; {b=$0}'
Upvotes: 2
Views: 115
Reputation: 203684
$ awk 'NR==FNR{c[$3]++;next} c[$3]>2' file file
scaffold_0 11498
scaffold_0 11501
scaffold_0 11728 "RHOH"
scaffold_0 12144 "RHOH"
scaffold_0 20708 "RHOH"
scaffold_0 23579 "RHOH"
scaffold_0 130818
scaffold_0 200485 "NSUN7"
scaffold_0 209928 "NSUN7"
scaffold_0 212965 "NSUN7"
scaffold_0 214055 "APBB2"
scaffold_0 223404
scaffold_0 223686 "APBB2"
scaffold_0 227687 "APBB2"
scaffold_0 306105 "APBB2"
scaffold_0 307000 "APBB2"
scaffold_0 391742
scaffold_0 482215
scaffold_0 484921
scaffold_0 538855 "LIMCH1"
scaffold_0 539051 "LIMCH1"
scaffold_0 539819
scaffold_0 543347 "LIMCH1"
scaffold_0 568182 "LIMCH1"
scaffold_0 570321
scaffold_0 570325
scaffold_0 577502 "LIMCH1"
scaffold_0 578933 "LIMCH1"
scaffold_0 640271
scaffold_0 679096
.
$ awk 'NR==FNR{if ($3!="") c[$3]++;next} c[$3]>2' file file
scaffold_0 11728 "RHOH"
scaffold_0 12144 "RHOH"
scaffold_0 20708 "RHOH"
scaffold_0 23579 "RHOH"
scaffold_0 200485 "NSUN7"
scaffold_0 209928 "NSUN7"
scaffold_0 212965 "NSUN7"
scaffold_0 214055 "APBB2"
scaffold_0 223686 "APBB2"
scaffold_0 227687 "APBB2"
scaffold_0 306105 "APBB2"
scaffold_0 307000 "APBB2"
scaffold_0 538855 "LIMCH1"
scaffold_0 539051 "LIMCH1"
scaffold_0 543347 "LIMCH1"
scaffold_0 568182 "LIMCH1"
scaffold_0 577502 "LIMCH1"
scaffold_0 578933 "LIMCH1"
Upvotes: 1
Reputation: 37414
This awk reads and hashes the whole file into memory
$ awk '{
a[NR]=$0 # hash to a using record number as the key for order
c[$3]++ # $3 counter
}
END { # after file records have been hashed
for(i=1;i<=NR;i++) { # iterate in order
split(a[i],b) # get the 3rd column
if(c[b[3]]>=3) # output if count is right
print a[i]
}
}' file
Output sample:
...
scaffold_0 306105 "APBB2"
scaffold_0 307000 "APBB2"
scaffold_0 391742
scaffold_0 482215
scaffold_0 484921
scaffold_0 538855 "LIMCH1"
scaffold_0 539051 "LIMCH1"
...
Upvotes: 1