Reputation: 3432
I have a dictionary such as :
mydict= {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}
I wondered if someone knew an efficient way to process this dictionary and create a dataframe from it by adding three columns:
G
and C
letters within Seq divided by the Seq_length (for example len(Seq) of scaffold1 is 42, and there are 18 G and C letters (so GC% = 18/42
)I should then get :
Scaffolds Seq_length GC%
scaffold1 42 0.428
scaffold2 53 0.453
I'm looking for an efficient way to do this task as my real dict is really huge (1,046,544 keys)
Thanks a lot for your help
Upvotes: 1
Views: 100
Reputation: 262429
You can rework the dictionary:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
mydict = {'scaffold1': SeqRecord(seq=Seq('AGAGGTAGAGGCAGAAAACATAGTGAGCACGCTGTGTTTAAT'), id='scaffold1', name='scaffold1', description='scaffold1 0.0', dbxrefs=[]), 'scaffold2': SeqRecord(seq=Seq('GCAAAAGCAAAGCCAGATCAGAGTCCAGACAGTGAAGGCAAGACTAGTAAAGT'), id='scaffold2', name='scaffold2', description='scaffold2 0.0', dbxrefs=[])}
from Bio.SeqUtils import GC
df = pd.DataFrame([{'Scaffolds': k,
'Seq_length': len(s.seq),
'GC%': GC(s.seq)}
for k, s in mydict.items()])
output:
Scaffolds Seq_length GC%
0 scaffold1 42 42.857143
1 scaffold2 53 45.283019
Upvotes: 2