Reputation: 129
For ease of use and compatibility with another downstream pipeline, I'm attempting to change the names of fastq sequence ids using biopython. For example... going from headers that look like this:
@D00602:32:H3LN7BCXX:1:1101:1205:2112 OP:i:1
@D00602:32:H3LN7BCXX:1:1101:1205:2112 OP:i:2
@D00602:32:H3LN7BCXX:1:1101:1182:2184 OP:i:1
@D00602:32:H3LN7BCXX:1:1101:1182:2184 OP:i:2
To headers that look like this:
@000000000000001 OP:i:1
@000000000000001 OP:i:2
@000000000000002 OP:i:1
@000000000000002 OP:i:2
I have some code, but I can't seem to get the alternating header count down (i.e. 1,1,2,2,3,3,etc.)
Any help would be appreciated. Thanks.
from Bio import SeqIO
import sys
FILE = sys.argv[1]
#Initialize numbering system at one
COUNT = 1
#Create a new dictionary for new sequence IDs
new_records=[]
for seq_record in SeqIO.parse(FILE, "fastq"):
header = '{:0>15}'.format(COUNT)
COUNT += 1
print(header)
seq_record.description =
seq_record.description.replace(seq_record.id, "")
seq_record.id = header
new_records.append(seq_record)
SeqIO.write(new_records, FILE, "fastq")
*seq_record does not contain the "OP:i:1" information
Upvotes: 3
Views: 335
Reputation: 759
Assuming you want all labels to be duplicated, all you have to do is divide count by the duplicated amount and return values rounded down, as shown below.
from Bio import SeqIO
import sys
FILE = sys.argv[1]
#Initialize numbering system at one
COUNT = 0
#Create a new dictionary for new sequence IDs
new_records=[]
for seq_record in SeqIO.parse(FILE, "fastq"):
header = '{:0>15}'.format(COUNT//2+1)
COUNT += 1
print(header)
seq_record.description =
seq_record.description.replace(seq_record.id, "")
seq_record.id = header
new_records.append(seq_record)
SeqIO.write(new_records, FILE, "fastq")
Upvotes: 4