Paul
Paul

Reputation: 696

Python print lines after context

How can I print two lines after the context i am interest in using python.

Example.fastq

@read1
AAAGGCTGTACTTCGTTCCAGTTG
+
'(''%$'))%**)2+'.(&&'/5-
@read2
CTGAGTTGAGTTAGTGTTGACTC
+
)(+-0-2145=588..,(1-,12

I can find the context of interest using...

fastq = open(Example.fastq, "r")

IDs = [read1]

with fastq as fq:
    for line in fq:
        if any(string in line for string in IDs):

Now that I have found read1 I want to print out the the following lines for read1. In bash i might use something like grep -A to do this. The desired output lines look like the following.

+
'(''%$'))%**)2+'.(&&'/5-

But in python i cant seem to find an equivalent tool. Perhaps "islice" might work but I don't see how I can get islice to start at the position of the match.

with fastq as fq:
    for line in fq:
        if any(string in line for string in IDs):
            print(list(islice(fq,3,4)))

Upvotes: 1

Views: 350

Answers (2)

BioGeek
BioGeek

Reputation: 22827

If you're handling FASTQ files, you're better off using a bioinformatics library like BioPython instead of rolling your own parser.

To get the exact result you requested, you can do:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

IDs = ['read1']

with open('Example.fastq') as in_handle:
    for title, seq, qual in FastqGeneralIterator(in_handle):
        # The ID is the first word in the title line (after the @ sign):
        if title.split(None, 1)[0] in IDs:
            # Line 3 is always a '+', optionally followed by the same sequence identifier again.
            print('+') 
            print(qual)

But you can't do much with the line of quality values on its own. Your next step will be almost certainly be to convert it to Phred quality scores. But this is notoriously complicated because there are at least three different and incompatible variants of the FASTQ file format. BioPython takes care of all the edge cases for you, so that you can just do:

from Bio.SeqIO import parse

IDs = ['read1']

with open('Example.fastq') as in_handle:
    for record in parse(in_handle, 'fastq'):
        if record.id in IDs:
            print(record.letter_annotations["phred_quality"])

Upvotes: 1

TemporalWolf
TemporalWolf

Reputation: 7952

You can use next() to advance an iterator (including files):

print(next(fq))
print(next(fq))

This consumes those lines, so the for loop will continue with @read2.

if you don't want the AAA... line, you can also just consume it with next(fq). In full:

fastq = open(Example.fastq, "r")

IDs = [read1]

with fastq as fq:
    for line in fq:
        if any(string in line for string in IDs):
            next(fq)  # skip AAA line
            print(next(fq).strip())  # strip off the extra newlines
            print(next(fq).strip())

which gives

+
'(''%$'))%**)2+'.(&&'/5-

Upvotes: 3

Related Questions