P. Solar
P. Solar

Reputation: 359

Wrong plotting data (insert size vs base quality means from BAM file)

I'am using PySam to calculate the insert reads and the mean base quality for the reads of a BAM as follows:

import math
import pysam
import numpy as np
import matplotlib.pyplot as plt

bam_file = pysam.AlignmentFile("mybam.bam", "rb")

mean_base_qualities = []
insert_sizes = []
zipped = []
for read in bam_file:
    if read.is_paired and read.is_proper_pair:
        # length
        mean_bq = np.mean(read.query_qualities)
        read_length = read.query_alignment_length
        tlen = read.template_length
        insert_size = int(math.sqrt(math.pow(tlen, 2))-read_length)
        if insert_size > 0:
            zipped.append((mean_bq, insert_size))
            mean_base_qualities.append(mean_bq)
            insert_sizes.append(insert_size)


plt.plot(mean_base_qualities,insert_sizes)
plt.show()

Finally, I've got two huge lists representing the insert sizes and the mean base quality for a BAM reads, like:

insert_sizes = [291, 188, 165, 212, 186, 127, 240, 166, 120, 303, 255, ...

mean_base_qualities = [36.24324324324324, 34.73831775700935, 35.3448275862069, 30.64788732394366, 36.467289719626166, 22.673267326732674, 36.54216867469879, 31.20754716981132, 25.085714285714285, 34.65625,  ...

When I try to plot them I am getting the following graph but it is obviously wrong:

enter image description here

I also tried to sort the lists independently getting:

enter image description here

And also tried zipping the lists and different ways of plotting but with no succeed. Could you help me, please? What I am missing? Thank you in advance!

----- UPDATE -----

Accordingly to Tom Morris comment, I've made the aggregate mean of mean qualities:

import math
import pysam
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import make_interp_spline

bam_file = pysam.AlignmentFile("mybam.bam", "rb")

mean_base_qualities = []
insert_sizes = []
zipped = []

i = 0
# No longer needed to use read.query_alignment_length for substracting to TLEN value
# TLEN represents the insert size properly
for read in bam_file:
    # Only processing paired proper reads
    if read.is_paired and read.is_proper_pair and read.is_read1:
        if read.template_length != 0:
            i+=1
            mean_bq = np.mean(read.query_qualities)
            i_size = int(math.sqrt(math.pow(read.template_length, 2)))
            zipped.append((i_size, mean_bq))
            mean_base_qualities.append(mean_bq)
            insert_sizes.append(i_size)

res = [(key, statistics.mean(map(itemgetter(1), ele)))
       for key, ele in groupby(sorted(zipped, key = itemgetter(0)),
                                                key = itemgetter(0))]

tuples = zip(*res)
list1, list2 = [list(tuple) for tuple in  tuples]  

xnew = np.linspace(min(list1), max(list1), len(list1))
gfg = make_interp_spline(list1, list2, k=3)
y_new = gfg(xnew)
plt.plot(xnew, y_new)
plt.show()

But graph still does not make any sense:

enter image description here

Because there are no negatives values for mean bases :/

print(max(list1)) # Max insert size: 25381844
print(min(list1)) # Min insert size: 2
print(max(list2)) # Max mean base quality: 37.0
print(min(list2)) # Min mean base quality. 18.677419354838708

Could you help me, please??

Upvotes: 0

Views: 128

Answers (0)

Related Questions