Reputation: 81
I have a fasta file that reads like so:
>00009c1cc42953fb4702f6331325c7cc
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGTTGTTAAGTCAGTGGTGAAATCGTGTGGCTCAACCATACGGAGCCATTGAAACTGGCGACCTTGAGTGTAAACGAGGTAGGCGGAATGTGACGTGTAGCGGTGAAATGCTTAGATATGTCACAGAACCCCGATTGCGAAGGCAGCTTACCAGCATACAACTGAC
>000118a5e731455e942c61a82a40367a623088d0
AGAGTTTTATCCTGGCTCAGGATGAACGCTAGCGGCAGGCCTAATACATGCAAGTCGGACGGGATCTAAATTTAAGCTTGCTTAAGTTTAGTGAGAGTGGCGCACGGGTGCGTAACGCGTGAGCAACCTACCCATATCAGGGGGATAGCCCGAAGAAATTCGGATTAACACCGCATAACACAGCAATCTCGCATGAGATCACTGTTAAATATTTATAGGATATGGATGGGCTCGCGTGACATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGTCTAGGGGCTCTGAGAGGAGAATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTAAGGATTATTGGTCAATGGAGGGAACTCTGAACCAGCCATGCCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTGTCTGGGAATAAACCTTGATTCGTGAATCAAGCTGAATGTACCAGAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTATAAGTCAGAGGTGAAAGACGGCAGCTTAACTGTCGCAGTGCCTTTGATACTGTATAGCTTGAATATCGTTGAAGATGGCGGAATGAGACAAGTAGCGGTGAAATGCATAGATATGTCTCAGAACTCCGATTGCGAAGGCAGCTGTCTAAGCGGCAATTGACGCTGATGCACGAAAGCGTGGGGATCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGATAACTGGATGTTGGCGATACACAGTCAGCGTCTTAGCGAAAGCGTTAAGTTATCCACCTGGGGAGTACGCCCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCCGGGCTTGAAAGTTAGTGAATGCGACAGAGACGTCTCAGTCCTTCGGGACACGAAACTAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATGTTTAGTTGCCAGCATGTAATGATGGGGACTCTAAACAGACTGCCTGCGTAAGCAGCGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGTCCGGGGCTACACACGTGCTACAATGGATGGTACAGCGGGCAGCTACACAGCAATGTGATGCTAATCTCTAAAAGCCATTCACAGTTCGGATAGGGGTCTGCAACTCGACCCCATGAAGTTGGATTCGCTAGTAATCGCGTATCAGCAATGACGCGGT
And I want to basically add microbial taxonomy to the seq IDs like so:
d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidales_RF16_group; g__Bacteroidales_RF16_group; s__uncultured_bacterium|00009c1cc42953fb4702f6331325c7cc
d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Sphingobacterium; s__uncultured_bacterium|000118a5e731455e942c61a82a40367a623088d0
Where the original seqID is appended to the taxonomy with a | as a separator.
Here is my original code that did not work where I made a list of the new seqIDs with the appended taxonomy that I named 'newids_list':
with open('allmergedrep-seqsf.fasta') as original, open('allmergedrep-seqsf2.fasta', 'w') as corrected:
for seq_record in SeqIO.parse(original, 'fasta'):
if seq_record.id in newids_list:
seq_record.id = seq_record.description = newids_list[seq_record.id]
SeqIO.write(seq_record, corrected, 'fasta')
I made the newids_list from a taxonomy file that has the same seqIDs as the fasta file and its in the same order already. ANy help would be appreciated!
EDIT:
Here is the result of the new fasta file (just showing first two sequences)
>00009c1cc42953fb4702f6331325c7cc
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGTTGT
TAAGTCAGTGGTGAAATCGTGTGGCTCAACCATACGGAGCCATTGAAACTGGCGACCTTG
AGTGTAAACGAGGTAGGCGGAATGTGACGTGTAGCGGTGAAATGCTTAGATATGTCACAG
AACCCCGATTGCGAAGGCAGCTTACCAGCATACAACTGAC
>000118a5e731455e942c61a82a40367a623088d0
AGAGTTTTATCCTGGCTCAGGATGAACGCTAGCGGCAGGCCTAATACATGCAAGTCGGAC
GGGATCTAAATTTAAGCTTGCTTAAGTTTAGTGAGAGTGGCGCACGGGTGCGTAACGCGT
GAGCAACCTACCCATATCAGGGGGATAGCCCGAAGAAATTCGGATTAACACCGCATAACA
CAGCAATCTCGCATGAGATCACTGTTAAATATTTATAGGATATGGATGGGCTCGCGTGAC
ATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGTCTAGGGGCTCTGAGAGG
AGAATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTAAGG
ATTATTGGTCAATGGAGGGAACTCTGAACCAGCCATGCCGCGTGCAGGATGACTGCCCTA
TGGGTTGTAAACTGCTTTTGTCTGGGAATAAACCTTGATTCGTGAATCAAGCTGAATGTA
CCAGAAGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGA
GCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTATAAGTCAGAGGTGA
AAGACGGCAGCTTAACTGTCGCAGTGCCTTTGATACTGTATAGCTTGAATATCGTTGAAG
ATGGCGGAATGAGACAAGTAGCGGTGAAATGCATAGATATGTCTCAGAACTCCGATTGCG
AAGGCAGCTGTCTAAGCGGCAATTGACGCTGATGCACGAAAGCGTGGGGATCAAACAGGA
TTAGATACCCTGGTAGTCCACGCCCTAAACGATGATAACTGGATGTTGGCGATACACAGT
CAGCGTCTTAGCGAAAGCGTTAAGTTATCCACCTGGGGAGTACGCCCGCAAGGGTGAAAC
TCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAGCATGTGGTTTAATTCGATGATAC
GCGAGGAACCTTACCCGGGCTTGAAAGTTAGTGAATGCGACAGAGACGTCTCAGTCCTTC
GGGACACGAAACTAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTA
AGTCCCGCAACGAGCGCAACCCCTATGTTTAGTTGCCAGCATGTAATGATGGGGACTCTA
AACAGACTGCCTGCGTAAGCAGCGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCT
TACGTCCGGGGCTACACACGTGCTACAATGGATGGTACAGCGGGCAGCTACACAGCAATG
TGATGCTAATCTCTAAAAGCCATTCACAGTTCGGATAGGGGTCTGCAACTCGACCCCATG
AAGTTGGATTCGCTAGTAATCGCGTATCAGCAATGACGCGGT
It seems to be identical to above but just reformatted differently..like word wrapped or something. But basically the seqID is the same.
Also for reference here is my newids_list (first couple new ids):
['d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidales_RF16_group; g__Bacteroidales_RF16_group; s__uncultured_bacterium|00009c1cc42953fb4702f6331325c7cc', 'd__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Sphingobacteriales; f__Sphingobacteriaceae; g__Sphingobacterium; s__uncultured_bacterium|000118a5e731455e942c61a82a40367a623088d0', 'd__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridia_UCG-014; f__Clostridia_UCG-014; g__Clostridia_UCG-014; s__uncultured_bacterium|0001536d70650564fec0c62905eeb73c']
where basically I am trying to add taxonomy before the seqID where they are both joined by a '|'. THank you!
Upvotes: 0
Views: 151
Reputation: 608
main issue with the code is that you treat list
as dict
(your new_list
) and that the ID
is really not in the new_list
, so you are actually not running the rename.
Here is an example of how I would do the rename to get you started
# define new_list as dict with keys being sequence ids and values the taxonomy
new_list = {id: tax for id, tax in zip(LIST_OF_SEQ_IDS, LIST_OF_TAX)} # you need to provide this somehow
original = [s for s in SeqIO.parse('allmergedrep-seqsf.fasta', 'fasta')]
corrected = []
for s in original:
# here we put the requested ID format
# note, that the FASTA ID usually do not contain spaces
s.id = '{}|{}'.format(new_list[s.id], s.id)
# BioPython sometimes adds IDs also here (and in some cases also to "s.name")
s.description = ''
corrected.append(s)
SeqIO.write(corrected, 'allmergedrep-seqsf2.fasta', 'fasta')
If your new_list
is truly in the same order and does already contain the sequences you want why you don't just:
with open('allmergedrep-seqsf.fasta') as original, open('allmergedrep-seqsf2.fasta', 'w') as corrected:
for seq_record, new_name in zip(SeqIO.parse(original, 'fasta'), new_list):
seq_record.id = new_name
seq_record.description = '' # do you need that taxonomy twice?
SeqIO.write(seq_record, corrected, 'fasta')
Upvotes: 1