Reputation: 9
given a FASTA text file (Rosalind_gc.txt), I am supposed to go through each DNA record and identify the percentage (%) of Guanine-Cytosine (GC) content.
Example of this is :
Sample Dataset:
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample output:
Rosalind_0808 60.919540
So basically go through each string, count the amt of times G/C show up and then divide that total by the length of each string. My issue is learning how to identify the breaks in code (i.e. >Rosalind_6404 ). I would like an example of this code without using Biopython and also with the biopython approach.
Upvotes: 0
Views: 196
Reputation: 2882
You may want to make use of precompiled C modules as much as possible for performance issue. There's one solution using regex:
seq = 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'
import re
perc = re.subn(r'[GC]', '', seq) / len(seq)
And also handle the ">" lines:
seq = []
name = ''
for line in open('Rosalind_gc.txt'):
if not line.startswith('>'):
seq.append(line.strip())
else:
if seq:
seq = ''.join(seq)
perc = re.subn(r'[GC]', '', seq) / len(seq)
print('{} has GC percent: {}'.format(name, perc * 100))
seq = []
name = line.strip()
Upvotes: 1
Reputation: 42143
You could read the file line by line and accumulate sequence data up to the next line that starts with ">" (plus one more time for the end of the file)
def getCount(seq):
return seq.count("G")+seq.count("C")
with open("input.txt","r") as file:
sequence = ""
name = ""
for line in file:
line = line.strip()
if not line.startswith(">"):
sequence += line
continue
if name != "":
print(name, 100*getCount(sequence)/len(sequence))
name = line[1:]
sequence = ""
print(name, 100*getCount(sequence)/len(sequence))
# Rosalind_6404 53.75
# Rosalind_5959 53.57142857142857
# Rosalind_0808 60.91954022988506
Upvotes: 2
Reputation: 41168
Since you're looking for a Biopython solution, here is a very simple one:
from Bio import SeqIO
from Bio.SeqUtils import GC
for r in SeqIO.parse('Rosalind_gc.fa', 'fasta'):
print(r.id, GC(r.seq))
Outputs:
Rosalind_6404 53.75
Rosalind_5959 53.57142857142857
Rosalind_0808 60.91954022988506
Upvotes: 2