Jay
Jay

Reputation: 25

Python parsing a FastQ file - sequence and quality score trimming

I am trying to trim a sequence based upon a trimmed quality score. Since I'm relatively new to python, I was looking for something simple that may do the trick.

I have the following quality score that is converted based on phred scores to numerical values:

1816 28 32 32 26 164 30 32 1816 1618 216 22 1616 216 218 20 28 1816 1816 24 1816 1620 20 24 28 28 28 216 26 28 1622 216 28 24 18 24 28 1622 30 222 24 18 218 216 26 26 218 28 1624 24 16 26 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Here is the corresponding sequence:

ACCGAGCCGAAGGAGACCGCATTCACCCGGATGCCCTTCGAGGCCAGCGCCACCGCCATCGACCGCGTCATCTGCTCCACCGCGGCCCAGCTGATGGAACA

Essentially, I want to be able to remove/cut/trim all value "2"s at the end of the quality score, compare the two strings and remove/cut/trim the corresponding sequence nucleotides at the end of the sequence.

I have tried splitting the quality score into two sections, and using that to print the first section of the sequence:

cutquality = actualquality.split(" 2 ",1)
newquality = cutquality[0]
lenofseq = len(cutquality[0].strip("  "))
newseq = actualseq[:lenofseq]

But it doesn't seem to print the cut sequence. I am able to cut the quality score where I want it to. Another thing you may have noticed is that the quality score values are not spaced appropriately. I'm not sure why this is occurring.. I used a " ".join() for the quality score values once they were converted and this is the output I got.

I would appreciate your suggestions!

Upvotes: 0

Views: 1724

Answers (2)

mhawke
mhawke

Reputation: 87074

To handle trimming from the end only of the quality scores you can use itertools.dropwhile() on the reversed quality_scores list to get rid of the trailing '2' items (thanks @cdlane for that idea). Then slice the bases to the length of the trimmed quality scores. No zipping required:

from itertools import dropwhile

quality_scores = '1816 28 32 32 26 164 30 32 1816 1618 216 22 1616 216 218 20 28 1816 1816 24 1816 1620 20 24 28 28 28 216 26 28 1622 216 28 24 18 24 28 1622 30 222 24 18 218 216 26 26 218 28 1624 24 16 26 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2'.split()
bases = 'ACCGAGCCGAAGGAGACCGCATTCACCCGGATGCCCTTCGAGGCCAGCGCCACCGCCATCGACCGCGTCATCTGCTCCACCGCGGCCCAGCTGATGGAACA'

length = len(list(dropwhile(lambda x: x == '2', reversed(quality_scores))))
newseq = bases[:length]
print(newseq)

Output

ACCGAGCCGAAGGAGACCGCATTCACCCGGATGCCCTTCGAGGCCAGCGCCA

Upvotes: 0

cdlane
cdlane

Reputation: 41872

dropwhile() from itertools, when applied to the reversed data, does what you want -- it filters until the filter is no longer true and then drops out completely, low quality scores will no longer trip it:

from itertools import dropwhile

qualities = '1816 28 32 32 26 164 30 32 1816 1618 216 22 1616 216 218 20 28 1816 1816 24 1816 1620 20 24 28 28 28 216 26 28 1622 216 28 24 18 24 28 1622 30 222 24 18 218 216 26 26 218 28 1624 24 16 26 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2'.split()

sequence = 'ACCGAGCCGAAGGAGACCGCATTCACCCGGATGCCCTTCGAGGCCAGCGCCACCGCCATCGACCGCGTCATCTGCTCCACCGCGGCCCAGCTGATGGAACA'

QUALITY_LIMIT = 2

QUALITY_SCORE, BASE = 0, 1

filter = lambda pair: pair[QUALITY_SCORE] <= QUALITY_LIMIT

reversed_quality_sequence = zip(reversed([int(quality) for quality in qualities]), reversed(sequence))

filtered_sequence = "".join(reversed([pair[BASE] for pair in dropwhile(filter, reversed_quality_sequence)]))

print(filtered_sequence)

OUTPUT

ACCGAGCCGAAGGAGACCGCATTCACCCGGATGCCCTTCGAGGCCAGCGCCA

This same technique can also be used (more easily) to clean up low qualty data at the start of the sequence.

Upvotes: 1

Related Questions