Reputation: 47
I have fastq file with strict formatting.
Input file:
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA
+
JJJHIIJFIJJJJ=BFFFFFEEEEEEDDDDDDDDDDBD
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
In my last question I solved my problem. But I do not correctly understand the file format. I need to get from input file this file:
output:
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
Where I remove read who isn't contain sequence.
This script works correctly. But i don't write regular expression to get what you want
awk '/\n[GATC]*\n/' RS=+ ORS=+
after script work I expected to see this output file. By this link you will see expression that describes the rows that I want to delete.
Upvotes: 1
Views: 1705
Reputation: 203684
It sounds like all you need is:
$ awk -v RS= '{gsub(/(^|\n)@[^\n]+\n\+\n[^\n]+\n/,"")}1' file
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
i.e. just delete any string that starts with "@" ((^|\n)@
) then a bunch of non-newline chars ([^\n]+
) then a +
between newlines (\n\+\n
) then a bunch of other non-newline chars terminated with a newline ([^\n]+\n
). If any lines can have leading or trailing whitespace then just throw a [[:blank:]]*
in wherever the white space could occur.
Upvotes: 0
Reputation: 44043
sed '/^@H/ { N; /\n+$/ { N; d } }' filename
This works as follows:
/^@H/ { # if the current line begins with @H
N # fetch the next one, append it.
/\n+$/ { # if the combined pattern has \n+ at the end (that is, if the new
# line is "+")
N # fetch another line
d # and discard the lot.
}
}
Upvotes: 2
Reputation: 174726
Through perl.
$ perl -0777pe 's/[GATC]+\h*\n\K\+.*?[GATC]+\n//gs' file
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
But this leaves the spaces at the last untouched. If you want to remove also the trailing spaces then try the below.
$ perl -0777pe 's/[GATC]+\K\h*\n\+.*?[GATC]+\n/\n/gs' file
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############
Upvotes: 0