susan123
susan123

Reputation: 11

how to write a fastq file from other file

I was asked to read from two files (left and right reads) Aip02.R1.fastq and Aip02.R2.fastq, and get an interleaved fasta file using zip function. The left and right files were fastq files, but when I zip them together to make a new fastq file, the writer function doesn't work anymore. It gives me error "SeqRecord (id=) has an invalid sequence."

  #!/usr/bin/env python3
  # Import Seq, SeqRecord, and SeqIO
  from Bio import SeqIO
  from Bio.SeqRecord import SeqRecord
  from Bio.Seq import Seq
  leftReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq", "fastq")
  rightReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq","fastq")
  A= zip(leftReads,rightReads)
  SeqIO.write(SeqRecord(list(A)), "interleave.fastq", "fastq")

Upvotes: 1

Views: 1377

Answers (1)

BioGeek
BioGeek

Reputation: 22827

Your forward and reverse sequences probably have the same ID. So use the following code to add a suffix to the IDs. I used /1 and /2 here, but things like .f and .r are also used.

from Bio import SeqIO
import itertools

def interleave(iter1, iter2) :
    for (forward, reverse) in itertools.izip(iter1, iter2):
        assert forward.id == reverse.id
        forward.id += "/1"
        reverse.id += "/2"
        yield forward
        yield reverse

leftReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq", "fastq")
rightReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq","fastq")

handle = open("interleave.fastq", "w")
count = SeqIO.write(interleave(leftReads, rightReads), handle, "fastq")
handle.close()
print("{} records written to interleave.fastq".format(count))

This code can become slow for large fastq files. For example see here where they report that it takes 14 minutes to create a 2GB fastq file. So they give this improved way:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
import itertools

file_f = "/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq"
file_r = "/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq"

handle = open("interleave.fastq", "w")
count = 0

f_iter = FastqGeneralIterator(open(file_f,"rU"))
r_iter = FastqGeneralIterator(open(file_r,"rU"))
for (f_id, f_seq, f_q), (r_id, r_seq, r_q) in itertools.izip(f_iter,r_iter):
    assert f_id == r_id
    count += 2
    #Write out both reads with "/1" and "/2" suffix on ID
    handle.write("@%s/1n%sn+n%sn@%s/2n%sn+n%sn" 
                 % (f_id, f_seq, f_q, r_id, r_seq, r_q))
handle.close()
print("{} records written to interleave.fastq".format(count)

Upvotes: 1

Related Questions