Reputation: 1040
I have genbank files consisting up multiple annotated contigs. What I'd like to do is separate that out into a database that has individual gene records containing each 'CDS' feature, along with its dna and amino acid sequence. So far this works just fine:
for record in SeqIO.parse(open_file, 'gb'):
for feature in record.features:
if feature.type == 'CDS':
gene_record = {
'locus_tag':feature.qualifiers['locus_tag'][0],
'translation':feature.qualifiers['translation'][0],
}
The trouble I'm having is getting the dna sequence. The genbank file is formatted thus:
FEATURES Location/Qualifiers
source 1..29869
/organism="Arthrobacter"
/mol_type="genomic DNA"
/strain="strain_name"
gene complement(4..462)
/locus_tag="ArthroDRAFT_00001"
CDS complement(4..462)
/locus_tag="ArthroDRAFT_00001"
/product="hypothetical protein"
/translation="LSTGKELLNYQSALNDIHDEFSRAQQSDAGVSHLSVAKITEKLS
YLKATALQMDDLFSVLRKQGVSLRSTGLADWASVPTIQDDKEEGKTEPSLAKKEISSR
TTSKPNKIDFPKFEYPDHGQPTNKIRVGTILDTFSESAFSYEWINVALQD"
gene complement(1126..1842)
/locus_tag="ArthroDRAFT_00002"
CDS complement(1126..1842)
/locus_tag="ArthroDRAFT_00002"
/product="hypothetical protein"
/translation="VPRAFIYGSCVGGDTANVFPSDWDRPTYVARQSIISAAFGPTSV
EGDIELTSAFQRSMLEGDIEATAFPRLRQELPTHDVLILDIVDERLGVYELAPGKYLT
RSMELISSKLIGKQPVTPRLIEFGSDEHYGLWTRSVDMLVDVVKHGGIPVFALLPPWS
EKSIQGEDLTWHSVSVDLMNNKYARYNEYLVQSEFTVVTVPDEEALGDAEHKWGLAPF
HYTESVYESLRDQILVGVSS"
...etc
ORIGIN
1 ccctcaatcc tgaagagcca cattgatcca ttcgtatgag aatgcagatt ccgagaatgt
61 atcaagaatt gttcctacac gtattttgtt tgtcggctgg ccgtggtccg gatactcaaa
121 ttttggaaag tcaatcttgt ...
So, I have the location for each feature, but I need to somehow parse the dna sequence to pull the right dna sequence, and I'm sort of at a loss for how to do this. feature.location
gives me what seems like a useful output:
[3:462](-)
[1125:1842](-)
[2159:3755](-)
[5190:5532](-)
[6226:6493](+)
But the only way I know to use this would be to parse that with regular expressions, then do record.seq[start_num:finish_num]
. And then if it's on the - strand, run whatever that is through reverse_complement
.
This seems way too convoluted, and I thought biopython must have a more efficient way, but I can't seem to find it. I noticed that the feature.location
already seems to have 1 subtracted from it to account for python's 0 index, so that number must be useable... right?
EDIT: the SeqFeature.location
seems like it's what I want to use, but can't figure out how: http://biopython.org/DIST/docs/api/Bio.SeqFeature.FeatureLocation-class.html
EDIT2: Looks like I can do
record.seq[feature.location.start:feature.location.end]
This does not yet work for negative strand stuff though...
Upvotes: 1
Views: 665
Reputation: 26
I use an "extract" method of location object.
dna_seq = feature.location.extract(record)
This works fine.
Upvotes: 1
Reputation: 1040
Ok, never mind - more googling and experimentation got me to a better solution. Not sure if it's the best solution, I'll leave the question up and unanswered in case anyone has better ideas. What I did to get the DNA sequence is the following:
if feature.location.strand == 1:
dna_seq = record.seq[
feature.location.start:feature.location.end
]
elif feature.location.strand == -1:
dna_seq = record.seq[
feature.location.start:feature.location.end
].reverse_complement()
Not the prettiest solution, but it seems to work...
Upvotes: 1