Reputation: 359
I'am using PySam to calculate the insert reads and the mean base quality for the reads of a BAM as follows:
import math
import pysam
import numpy as np
import matplotlib.pyplot as plt
bam_file = pysam.AlignmentFile("mybam.bam", "rb")
mean_base_qualities = []
insert_sizes = []
zipped = []
for read in bam_file:
if read.is_paired and read.is_proper_pair:
# length
mean_bq = np.mean(read.query_qualities)
read_length = read.query_alignment_length
tlen = read.template_length
insert_size = int(math.sqrt(math.pow(tlen, 2))-read_length)
if insert_size > 0:
zipped.append((mean_bq, insert_size))
mean_base_qualities.append(mean_bq)
insert_sizes.append(insert_size)
plt.plot(mean_base_qualities,insert_sizes)
plt.show()
Finally, I've got two huge lists representing the insert sizes and the mean base quality for a BAM reads, like:
insert_sizes = [291, 188, 165, 212, 186, 127, 240, 166, 120, 303, 255, ...
mean_base_qualities = [36.24324324324324, 34.73831775700935, 35.3448275862069, 30.64788732394366, 36.467289719626166, 22.673267326732674, 36.54216867469879, 31.20754716981132, 25.085714285714285, 34.65625, ...
When I try to plot them I am getting the following graph but it is obviously wrong:
I also tried to sort the lists independently getting:
And also tried zipping the lists and different ways of plotting but with no succeed. Could you help me, please? What I am missing? Thank you in advance!
----- UPDATE -----
Accordingly to Tom Morris comment, I've made the aggregate mean of mean qualities:
import math
import pysam
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import make_interp_spline
bam_file = pysam.AlignmentFile("mybam.bam", "rb")
mean_base_qualities = []
insert_sizes = []
zipped = []
i = 0
# No longer needed to use read.query_alignment_length for substracting to TLEN value
# TLEN represents the insert size properly
for read in bam_file:
# Only processing paired proper reads
if read.is_paired and read.is_proper_pair and read.is_read1:
if read.template_length != 0:
i+=1
mean_bq = np.mean(read.query_qualities)
i_size = int(math.sqrt(math.pow(read.template_length, 2)))
zipped.append((i_size, mean_bq))
mean_base_qualities.append(mean_bq)
insert_sizes.append(i_size)
res = [(key, statistics.mean(map(itemgetter(1), ele)))
for key, ele in groupby(sorted(zipped, key = itemgetter(0)),
key = itemgetter(0))]
tuples = zip(*res)
list1, list2 = [list(tuple) for tuple in tuples]
xnew = np.linspace(min(list1), max(list1), len(list1))
gfg = make_interp_spline(list1, list2, k=3)
y_new = gfg(xnew)
plt.plot(xnew, y_new)
plt.show()
But graph still does not make any sense:
Because there are no negatives values for mean bases :/
print(max(list1)) # Max insert size: 25381844
print(min(list1)) # Min insert size: 2
print(max(list2)) # Max mean base quality: 37.0
print(min(list2)) # Min mean base quality. 18.677419354838708
Could you help me, please??
Upvotes: 0
Views: 128