ARM
ARM

Reputation: 99

filtering a complex text file in bash

I have a text file like this:

@M00872:408:000000000-D31AB:1:1102:15653:1337 1:N:0:ATCACG
CGCGACCTCAGATCAGACGTGGCGACCCGCTGAATTTAAGCA
+
BCCBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHHHHH
@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH

every 4 lines are belong one group and the first line of each group starts with @. the 2nd line of each group is important for me so I would like to filter out the groups based on 2nd line. in fact if this specific sequence "GATCAGACGTGGCGAC" is present in the 2nd line, I want to remove the whole group and make a new file containing other groups. so the result for this example is:

@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH

I tried the following command but it returns only the 2nd line and only the ones which contain this piece of sequence. but I want the whole group and if the 2nd line does not contain this sequnce.

grep -i GATCAGACGTGGCGAC myfile.txt > output.txt

do you know how to fix it?

Upvotes: 1

Views: 66

Answers (1)

anubhava
anubhava

Reputation: 785971

Single awk solution:

awk -v kw='GATCAGACGTGGCGAC' '/^@/{if (txt !~ kw) printf "%s", txt; n=4; txt=""} n-->0{
txt=txt $0 RS} END{if (txt !~ kw) printf "%s", txt}' file

@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH

Alternative grep + gnu awk solution:

grep -A 3 '^@' file | awk -v RS='--\n' -v ORS= '!/GATCAGACGTGGCGAC/'

@M00872:408:000000000-D31AB:1:1102:15388:1343 1:N:0:ATCACG
CGCGACCTCATGAATTTAAGGGCGACCCGCTGAATTTAAGCA
+
CBBBGGGGGGGGGGHHHHGGGGGGGGGGGGGGGHHHHHGHHH

Upvotes: 2

Related Questions