Biopython SeqIO: how to write modified SeqRecord header

Question

I thought I'd try using Biopython to salvage some trivially corrupted fastq files provided by a collaborator. I simply need to modify header lines (starting with @) which contain a certain substring. However, the new fastq files created by the following code are unchanged. No doubt I'm missing something obvious.

What is the correct way to write a modified fastq SeqRecord?

import os, sys
from Bio import SeqIO

path_to_reads = sys.argv[1]
if not os.path.exists(path_to_reads + '/fixed'):
    os.mkdir(path_to_reads + '/fixed')

fwd_fastqs = [fn for fn in os.listdir(path_to_reads) if fn.endswith('_F.fastq')]
rev_fastqs = [fn for fn in os.listdir(path_to_reads) if fn.endswith('_R.fastq')]
fastq_pairs = zip(fwd_fastqs, rev_fastqs)

for fastq_pair in fastq_pairs:
    with open(path_to_reads + '/' +  fastq_pair[0], 'rU') as fwd_fastq:
        with open(path_to_reads + '/fixed/' +  fastq_pair[0], 'w') as fixed_fwd_fastq:
            fixed_fwd_records = []
            for fwd_record in SeqIO.parse(fwd_fastq, 'fastq'):
                fwd_record.name = fwd_record.name.replace('/2','/1')
                fixed_fwd_records.append(fwd_record)
            SeqIO.write(fixed_fwd_records, fixed_fwd_fastq, 'fastq')
    # ...

Input data (two records, header lines start with @):

@MISEQ01:115:000000000-A8FBM:1:1112:18038:15085/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGCCAATCATCTCGTATGCCGTCTTCTGCTTG
+
AAAAAAAAAF4CGGGGGAGGGFGHBHGHC5AGEGFFHGA3F355FGG223FFAEE0GCGA55BAB
@MISEQ01:115:000000000-A8FBM:1:1101:20590:9966/2
GATCACTCCCCTGTGAGGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGA
+
1>A111DFBA1CFA1FFG1BFGB1D1DGFGH3GECA0ABFFG?E///DDGFBB0FEAEEFBDAB2

dkatzel · Accepted Answer

I'm not a python guy but I do bioinformatics so I understand the file formats. I can explain what happened and why:

Looking at the BioPython Bio.SeqIO.QualityIO fastq writer code the way that BioPython's SeqRecord object works is it has 2 fields to store parts of the defline. A name and a description. Usually one would think it would work like FASTA files and split the defline on white space with the name being the left split and the description being the optional comment in the right split. However the BioPython parser puts a copy of the defline as the description. My guess is this is a hack (along with the writer code I explain below) to get around CASAVA 1.8 reads which have spaces in them.

When the writer writes out the records, it checks to see if the name and description match and if they DON'T then it writes out the description line which is assumed to be the CASAVA 1.8 read I guess...

Since you were only changing the name portion, the match test was failing so the unchanged description was was used instead. When you blanked out the description the writer correctly uses the name field instead.

Biopython SeqIO: how to write modified SeqRecord header

Answers (2)

Related Questions