Carey
Carey

Reputation: 13

Python GC Counter - Rosalind

I'm attempting to write a programme that will calculate the GC content in each of a series of sequences (input in fasta format) and then return the name of the sequence with the highest percentage and its GC percentage. As per this Rosalind problem.

I've finally stopped getting error messages but my code doesn't appear to do anything. Does anyone have any idea why that might be?

#Define functions
#Calculate GC percentage 
def Percent(sequence):
G_count = sequence.count ('G')
C_count = sequence.count ('C')
Total_count = len(sequence)
GC_Sum = int(G_count) + int(C_count)
Percent_GC = GC_Sum / Total_count
Per_GC = (Percent_GC)*100
return Per_GC

Input = input ("Input Sequence")

#Fasta file into dictionary
fasta_dictionary = {}
sequence_name = ""
for line in Input:
    line = line.strip()
    if not line:
        continue
    if line.startswith(">"):
        sequence_name = line[1:]
        if sequence_name not in fasta_dictionary:
            fasta_dictionary[sequence_name] = []
        continue
    sequence = line
    fasta_dictionary[sequence_name].append(sequence)

#Put GC values for each sequence into dictionary
dictionary = dict()
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(sequence))

#Find highest
for sequence_name, sequence in fasta_dictionary.items():
    inverse = [(sequence, sequence_name) for sequence_name, sequence in dictionary.items()]
    highest_GC = max(inverse)[1]  

#Find sequence name
for sequence_name, sequence in fasta_dictionary.items():
        if sequence == highest_GC:
            print ((sequence_name) + ' ' + (highest_GC))

Upvotes: 3

Views: 3199

Answers (2)

Seekheart
Seekheart

Reputation: 1173

Another solution for your GC problem would be to use the Counter higher-order data structure in python. It can automatically set and count your nucleotides for you such that you can just directly ask for the numbers to calculate as follows:

from collections import Counter

#set a var to hold your dna
myDna = ''
#open your Dna fasta
with open('myFasta', 'r') as data:
     for line in data:
          if '>' in line:
               continue
          myDna += line.strip()

#Now count your dna
myNucleotideCounts = Counter(myDna)

#calculate GC content
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))

print('Dna GC Content = {0}'.format(myGC))

Upvotes: 0

Kevin
Kevin

Reputation: 8207

So, Pier Paolo is correct change the first line to with open() and indent the rest of your code under that as shown below.

with open('/path/to/your/fasta.fasta', 'r') as Input:
   fasta_dictionary = {}

He is also correct on the division -- this should help inside your Percent function. Percent_GC = float(GC_Sum) / Total_count

Instead of appending, just assign sequence as a string.

sequence = line
fasta_dictionary[sequence_name] = sequence

Next, when you are calling your Percent function, you are passing sequence, after you've exited the for loop where you define each sequence iteratively. You have them stored in a dictionary called fasta_dictionary, so change this piece of code.

for sequence_name in fasta_dictionary:
        dictionary[sequence_name] = float(Percent(fasta_dictionary[sequence_name]))

Finally, at the end, you are checking if sequence == highest_GC:. This is what you're currently checking:

for sequence_name, sequence in fasta_dictionary.items():
            print sequence

prints a str of the actual sequence data.

'ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC'

and highest_GC is the "name" of the sequence

>sequence1

change that to read if sequence_name == highest_GC

Running your code with the changes above always printed the name of the sequence with the highest GC content %. There are a lot of other unnecessary steps and repetitive code, but hopefully this gets you started. Good Luck!

Upvotes: 1

Related Questions