Reputation: 11
I was asked to read from two files (left and right reads) Aip02.R1.fastq and Aip02.R2.fastq, and get an interleaved fasta file using zip function. The left and right files were fastq files, but when I zip them together to make a new fastq file, the writer function doesn't work anymore. It gives me error "SeqRecord (id=) has an invalid sequence."
#!/usr/bin/env python3
# Import Seq, SeqRecord, and SeqIO
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
leftReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq", "fastq")
rightReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq","fastq")
A= zip(leftReads,rightReads)
SeqIO.write(SeqRecord(list(A)), "interleave.fastq", "fastq")
Upvotes: 1
Views: 1377
Reputation: 22827
Your forward and reverse sequences probably have the same ID. So use the following code to add a suffix to the IDs. I used /1
and /2
here, but things like .f
and .r
are also used.
from Bio import SeqIO
import itertools
def interleave(iter1, iter2) :
for (forward, reverse) in itertools.izip(iter1, iter2):
assert forward.id == reverse.id
forward.id += "/1"
reverse.id += "/2"
yield forward
yield reverse
leftReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq", "fastq")
rightReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq","fastq")
handle = open("interleave.fastq", "w")
count = SeqIO.write(interleave(leftReads, rightReads), handle, "fastq")
handle.close()
print("{} records written to interleave.fastq".format(count))
This code can become slow for large fastq files. For example see here where they report that it takes 14 minutes to create a 2GB fastq file. So they give this improved way:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import itertools
file_f = "/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq"
file_r = "/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq"
handle = open("interleave.fastq", "w")
count = 0
f_iter = FastqGeneralIterator(open(file_f,"rU"))
r_iter = FastqGeneralIterator(open(file_r,"rU"))
for (f_id, f_seq, f_q), (r_id, r_seq, r_q) in itertools.izip(f_iter,r_iter):
assert f_id == r_id
count += 2
#Write out both reads with "/1" and "/2" suffix on ID
handle.write("@%s/1n%sn+n%sn@%s/2n%sn+n%sn"
% (f_id, f_seq, f_q, r_id, r_seq, r_q))
handle.close()
print("{} records written to interleave.fastq".format(count)
Upvotes: 1