Bede Constantinides
Bede Constantinides

Reputation: 2569

Biopython SeqIO: how to write modified SeqRecord header

I thought I'd try using Biopython to salvage some trivially corrupted fastq files provided by a collaborator. I simply need to modify header lines (starting with @) which contain a certain substring. However, the new fastq files created by the following code are unchanged. No doubt I'm missing something obvious.

What is the correct way to write a modified fastq SeqRecord?

import os, sys
from Bio import SeqIO

path_to_reads = sys.argv[1]
if not os.path.exists(path_to_reads + '/fixed'):
    os.mkdir(path_to_reads + '/fixed')

fwd_fastqs = [fn for fn in os.listdir(path_to_reads) if fn.endswith('_F.fastq')]
rev_fastqs = [fn for fn in os.listdir(path_to_reads) if fn.endswith('_R.fastq')]
fastq_pairs = zip(fwd_fastqs, rev_fastqs)

for fastq_pair in fastq_pairs:
    with open(path_to_reads + '/' +  fastq_pair[0], 'rU') as fwd_fastq:
        with open(path_to_reads + '/fixed/' +  fastq_pair[0], 'w') as fixed_fwd_fastq:
            fixed_fwd_records = []
            for fwd_record in SeqIO.parse(fwd_fastq, 'fastq'):
                fwd_record.name = fwd_record.name.replace('/2','/1')
                fixed_fwd_records.append(fwd_record)
            SeqIO.write(fixed_fwd_records, fixed_fwd_fastq, 'fastq')
    # ...

Input data (two records, header lines start with @):

@MISEQ01:115:000000000-A8FBM:1:1112:18038:15085/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGCCAATCATCTCGTATGCCGTCTTCTGCTTG
+
AAAAAAAAAF4CGGGGGAGGGFGHBHGHC5AGEGFFHGA3F355FGG223FFAEE0GCGA55BAB
@MISEQ01:115:000000000-A8FBM:1:1101:20590:9966/2
GATCACTCCCCTGTGAGGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGA
+
1>A111DFBA1CFA1FFG1BFGB1D1DGFGH3GECA0ABFFG?E///DDGFBB0FEAEEFBDAB2

Upvotes: 1

Views: 1591

Answers (2)

dkatzel
dkatzel

Reputation: 31648

I'm not a python guy but I do bioinformatics so I understand the file formats. I can explain what happened and why:

Looking at the BioPython Bio.SeqIO.QualityIO fastq writer code the way that BioPython's SeqRecord object works is it has 2 fields to store parts of the defline. A name and a description. Usually one would think it would work like FASTA files and split the defline on white space with the name being the left split and the description being the optional comment in the right split. However the BioPython parser puts a copy of the defline as the description. My guess is this is a hack (along with the writer code I explain below) to get around CASAVA 1.8 reads which have spaces in them.

When the writer writes out the records, it checks to see if the name and description match and if they DON'T then it writes out the description line which is assumed to be the CASAVA 1.8 read I guess...

Since you were only changing the name portion, the match test was failing so the unchanged description was was used instead. When you blanked out the description the writer correctly uses the name field instead.

Upvotes: 2

Bede Constantinides
Bede Constantinides

Reputation: 2569

I found a solution, and I don't think it's very obvious. Read header lines can be accessed through any of SeqRecord.id, SeqRecord.name and SeqRecord.description.

No doubt there are subtle differences between them, but I've skimmed over the SeqIO docs, and they aren't explicitly mentioned. If I add fwd_record.description = '', my script replaces incidences of /1 with /2 as intended.

So, working code:

for fastq_pair in fastq_pairs:
    with open(path_to_reads + '/' +  fastq_pair[0], 'rU') as fwd_fastq:
        with open(path_to_reads + '/fixed/' +  fastq_pair[0], 'w') as fixed_fwd_fastq:
            fixed_fwd_records = []
            for fwd_record in SeqIO.parse(fwd_fastq, 'fastq'):
                fwd_record.name = fwd_record.name.replace('/2','/1')
                fwd_record.description = ''
                fixed_fwd_records.append(fwd_record)
            SeqIO.write(fixed_fwd_records, fixed_fwd_fastq, 'fastq')

Upvotes: 0

Related Questions