Gunther
Gunther

Reputation: 129

Renaming interleaved fastq headers with biopython

For ease of use and compatibility with another downstream pipeline, I'm attempting to change the names of fastq sequence ids using biopython. For example... going from headers that look like this:

@D00602:32:H3LN7BCXX:1:1101:1205:2112 OP:i:1
@D00602:32:H3LN7BCXX:1:1101:1205:2112 OP:i:2
@D00602:32:H3LN7BCXX:1:1101:1182:2184 OP:i:1
@D00602:32:H3LN7BCXX:1:1101:1182:2184 OP:i:2

To headers that look like this:

@000000000000001  OP:i:1
@000000000000001  OP:i:2
@000000000000002  OP:i:1
@000000000000002  OP:i:2

I have some code, but I can't seem to get the alternating header count down (i.e. 1,1,2,2,3,3,etc.)

Any help would be appreciated. Thanks.

from Bio import SeqIO
import sys

FILE = sys.argv[1]

#Initialize numbering system at one
COUNT = 1

#Create a new dictionary for new sequence IDs
new_records=[]

for seq_record in SeqIO.parse(FILE, "fastq"):
        header = '{:0>15}'.format(COUNT)
        COUNT += 1
        print(header)
        seq_record.description = 
seq_record.description.replace(seq_record.id, "")
        seq_record.id = header
        new_records.append(seq_record)
SeqIO.write(new_records, FILE, "fastq")

*seq_record does not contain the "OP:i:1" information

Upvotes: 3

Views: 335

Answers (1)

Ghoti
Ghoti

Reputation: 759

Assuming you want all labels to be duplicated, all you have to do is divide count by the duplicated amount and return values rounded down, as shown below.

from Bio import SeqIO
import sys

FILE = sys.argv[1]

#Initialize numbering system at one
COUNT = 0

#Create a new dictionary for new sequence IDs
new_records=[]

for seq_record in SeqIO.parse(FILE, "fastq"):
        header = '{:0>15}'.format(COUNT//2+1)
        COUNT += 1
        print(header)
        seq_record.description = 
seq_record.description.replace(seq_record.id, "")
        seq_record.id = header
        new_records.append(seq_record)
SeqIO.write(new_records, FILE, "fastq")

Upvotes: 4

Related Questions