Reputation: 696
How can I print two lines after the context i am interest in using python.
Example.fastq
@read1
AAAGGCTGTACTTCGTTCCAGTTG
+
'(''%$'))%**)2+'.(&&'/5-
@read2
CTGAGTTGAGTTAGTGTTGACTC
+
)(+-0-2145=588..,(1-,12
I can find the context of interest using...
fastq = open(Example.fastq, "r")
IDs = [read1]
with fastq as fq:
for line in fq:
if any(string in line for string in IDs):
Now that I have found read1 I want to print out the the following lines for read1. In bash i might use something like grep -A to do this. The desired output lines look like the following.
+
'(''%$'))%**)2+'.(&&'/5-
But in python i cant seem to find an equivalent tool. Perhaps "islice" might work but I don't see how I can get islice to start at the position of the match.
with fastq as fq:
for line in fq:
if any(string in line for string in IDs):
print(list(islice(fq,3,4)))
Upvotes: 1
Views: 350
Reputation: 22827
If you're handling FASTQ files, you're better off using a bioinformatics library like BioPython instead of rolling your own parser.
To get the exact result you requested, you can do:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
IDs = ['read1']
with open('Example.fastq') as in_handle:
for title, seq, qual in FastqGeneralIterator(in_handle):
# The ID is the first word in the title line (after the @ sign):
if title.split(None, 1)[0] in IDs:
# Line 3 is always a '+', optionally followed by the same sequence identifier again.
print('+')
print(qual)
But you can't do much with the line of quality values on its own. Your next step will be almost certainly be to convert it to Phred quality scores. But this is notoriously complicated because there are at least three different and incompatible variants of the FASTQ file format. BioPython takes care of all the edge cases for you, so that you can just do:
from Bio.SeqIO import parse
IDs = ['read1']
with open('Example.fastq') as in_handle:
for record in parse(in_handle, 'fastq'):
if record.id in IDs:
print(record.letter_annotations["phred_quality"])
Upvotes: 1
Reputation: 7952
You can use next()
to advance an iterator (including files):
print(next(fq))
print(next(fq))
This consumes those lines, so the for
loop will continue with @read2
.
if you don't want the AAA...
line, you can also just consume it with next(fq)
. In full:
fastq = open(Example.fastq, "r")
IDs = [read1]
with fastq as fq:
for line in fq:
if any(string in line for string in IDs):
next(fq) # skip AAA line
print(next(fq).strip()) # strip off the extra newlines
print(next(fq).strip())
which gives
+
'(''%$'))%**)2+'.(&&'/5-
Upvotes: 3