Pull dna sequence by feature from genbank file

Question

I have genbank files consisting up multiple annotated contigs. What I'd like to do is separate that out into a database that has individual gene records containing each 'CDS' feature, along with its dna and amino acid sequence. So far this works just fine:

for record in SeqIO.parse(open_file, 'gb'):
   for feature in record.features:
        if feature.type == 'CDS':
            gene_record = {
                'locus_tag':feature.qualifiers['locus_tag'][0],
                'translation':feature.qualifiers['translation'][0],
                }

The trouble I'm having is getting the dna sequence. The genbank file is formatted thus:

FEATURES             Location/Qualifiers
     source          1..29869
                     /organism="Arthrobacter"
                     /mol_type="genomic DNA"
                     /strain="strain_name"
     gene            complement(4..462)
                     /locus_tag="ArthroDRAFT_00001"
     CDS             complement(4..462)
                     /locus_tag="ArthroDRAFT_00001"
                     /product="hypothetical protein"
                     /translation="LSTGKELLNYQSALNDIHDEFSRAQQSDAGVSHLSVAKITEKLS
                     YLKATALQMDDLFSVLRKQGVSLRSTGLADWASVPTIQDDKEEGKTEPSLAKKEISSR
                     TTSKPNKIDFPKFEYPDHGQPTNKIRVGTILDTFSESAFSYEWINVALQD"
     gene            complement(1126..1842)
                     /locus_tag="ArthroDRAFT_00002"
     CDS             complement(1126..1842)
                     /locus_tag="ArthroDRAFT_00002"
                     /product="hypothetical protein"
                     /translation="VPRAFIYGSCVGGDTANVFPSDWDRPTYVARQSIISAAFGPTSV
                     EGDIELTSAFQRSMLEGDIEATAFPRLRQELPTHDVLILDIVDERLGVYELAPGKYLT
                     RSMELISSKLIGKQPVTPRLIEFGSDEHYGLWTRSVDMLVDVVKHGGIPVFALLPPWS
                     EKSIQGEDLTWHSVSVDLMNNKYARYNEYLVQSEFTVVTVPDEEALGDAEHKWGLAPF
                     HYTESVYESLRDQILVGVSS"
...etc

ORIGIN
        1 ccctcaatcc tgaagagcca cattgatcca ttcgtatgag aatgcagatt ccgagaatgt
       61 atcaagaatt gttcctacac gtattttgtt tgtcggctgg ccgtggtccg gatactcaaa
      121 ttttggaaag tcaatcttgt ...

So, I have the location for each feature, but I need to somehow parse the dna sequence to pull the right dna sequence, and I'm sort of at a loss for how to do this. feature.location gives me what seems like a useful output:

[3:462](-)
[1125:1842](-)
[2159:3755](-)
[5190:5532](-)
[6226:6493](+)

But the only way I know to use this would be to parse that with regular expressions, then do record.seq[start_num:finish_num]. And then if it's on the - strand, run whatever that is through reverse_complement.

This seems way too convoluted, and I thought biopython must have a more efficient way, but I can't seem to find it. I noticed that the feature.location already seems to have 1 subtracted from it to account for python's 0 index, so that number must be useable... right?

EDIT: the SeqFeature.location seems like it's what I want to use, but can't figure out how: http://biopython.org/DIST/docs/api/Bio.SeqFeature.FeatureLocation-class.html

EDIT2: Looks like I can do

record.seq[feature.location.start:feature.location.end]

This does not yet work for negative strand stuff though...

user5487747 · Accepted Answer

I use an "extract" method of location object.

dna_seq = feature.location.extract(record)

This works fine.

Pull dna sequence by feature from genbank file

Answers (2)

Related Questions