Reputation: 13
I'm attempting to write a programme that will calculate the GC content in each of a series of sequences (input in fasta format) and then return the name of the sequence with the highest percentage and its GC percentage. As per this Rosalind problem.
I've finally stopped getting error messages but my code doesn't appear to do anything. Does anyone have any idea why that might be?
#Define functions
#Calculate GC percentage
def Percent(sequence):
G_count = sequence.count ('G')
C_count = sequence.count ('C')
Total_count = len(sequence)
GC_Sum = int(G_count) + int(C_count)
Percent_GC = GC_Sum / Total_count
Per_GC = (Percent_GC)*100
return Per_GC
Input = input ("Input Sequence")
#Fasta file into dictionary
fasta_dictionary = {}
sequence_name = ""
for line in Input:
line = line.strip()
if not line:
continue
if line.startswith(">"):
sequence_name = line[1:]
if sequence_name not in fasta_dictionary:
fasta_dictionary[sequence_name] = []
continue
sequence = line
fasta_dictionary[sequence_name].append(sequence)
#Put GC values for each sequence into dictionary
dictionary = dict()
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(sequence))
#Find highest
for sequence_name, sequence in fasta_dictionary.items():
inverse = [(sequence, sequence_name) for sequence_name, sequence in dictionary.items()]
highest_GC = max(inverse)[1]
#Find sequence name
for sequence_name, sequence in fasta_dictionary.items():
if sequence == highest_GC:
print ((sequence_name) + ' ' + (highest_GC))
Upvotes: 3
Views: 3199
Reputation: 1173
Another solution for your GC problem would be to use the Counter higher-order data structure in python. It can automatically set and count your nucleotides for you such that you can just directly ask for the numbers to calculate as follows:
from collections import Counter
#set a var to hold your dna
myDna = ''
#open your Dna fasta
with open('myFasta', 'r') as data:
for line in data:
if '>' in line:
continue
myDna += line.strip()
#Now count your dna
myNucleotideCounts = Counter(myDna)
#calculate GC content
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))
print('Dna GC Content = {0}'.format(myGC))
Upvotes: 0
Reputation: 8207
So, Pier Paolo is correct change the first line to with open()
and indent the rest of your code under that as shown below.
with open('/path/to/your/fasta.fasta', 'r') as Input:
fasta_dictionary = {}
He is also correct on the division -- this should help inside your Percent
function. Percent_GC = float(GC_Sum) / Total_count
Instead of appending, just assign sequence
as a string.
sequence = line
fasta_dictionary[sequence_name] = sequence
Next, when you are calling your Percent
function, you are passing sequence
, after you've exited the for loop where you define each sequence
iteratively. You have them stored in a dictionary called fasta_dictionary
, so change this piece of code.
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(fasta_dictionary[sequence_name]))
Finally, at the end, you are checking if sequence == highest_GC:
. This is what you're currently checking:
for sequence_name, sequence in fasta_dictionary.items():
print sequence
prints a str
of the actual sequence data.
'ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC'
and highest_GC
is the "name" of the sequence
>sequence1
change that to read if sequence_name == highest_GC
Running your code with the changes above always printed the name of the sequence with the highest GC content %. There are a lot of other unnecessary steps and repetitive code, but hopefully this gets you started. Good Luck!
Upvotes: 1