Reputation: 99
I have a text file like this:
@M00872:408:000000000-D31AB:1:1102:15653:1337 1:N:0:ATCACG
CGCGACCTCAGATCAGACGTGGCGACCCGCTGAATTTAAGCA
+
BCCBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHHHHH
@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH
every 4 lines are belong one group and the first line of each group starts with @
.
the 2nd line of each group is important for me so I would like to filter out the groups based on 2nd line. in fact if this specific sequence "GATCAGACGTGGCGAC
" is present in the 2nd line, I want to remove the whole group and make a new file containing other groups.
so the result for this example is:
@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH
I tried the following command but it returns only the 2nd line and only the ones which contain this piece of sequence. but I want the whole group and if the 2nd line does not contain this sequnce.
grep -i GATCAGACGTGGCGAC myfile.txt > output.txt
do you know how to fix it?
Upvotes: 1
Views: 66
Reputation: 785971
Single awk
solution:
awk -v kw='GATCAGACGTGGCGAC' '/^@/{if (txt !~ kw) printf "%s", txt; n=4; txt=""} n-->0{
txt=txt $0 RS} END{if (txt !~ kw) printf "%s", txt}' file
@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH
Alternative grep + gnu awk
solution:
grep -A 3 '^@' file | awk -v RS='--\n' -v ORS= '!/GATCAGACGTGGCGAC/'
@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH
Upvotes: 2