Thomas William Dunn
Thomas William Dunn

Reputation: 9

How to write a string algorithm

given a FASTA text file (Rosalind_gc.txt), I am supposed to go through each DNA record and identify the percentage (%) of Guanine-Cytosine (GC) content.

Example of this is :

Sample Dataset:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG    
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample output:

Rosalind_0808 60.919540

So basically go through each string, count the amt of times G/C show up and then divide that total by the length of each string. My issue is learning how to identify the breaks in code (i.e. >Rosalind_6404 ). I would like an example of this code without using Biopython and also with the biopython approach.

Upvotes: 0

Views: 196

Answers (3)

knh190
knh190

Reputation: 2882

You may want to make use of precompiled C modules as much as possible for performance issue. There's one solution using regex:

seq = 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'

import re
perc = re.subn(r'[GC]', '', seq) / len(seq)

And also handle the ">" lines:

seq = []
name = ''

for line in open('Rosalind_gc.txt'):
    if not line.startswith('>'):
        seq.append(line.strip())
    else:
        if seq:
            seq = ''.join(seq)
            perc = re.subn(r'[GC]', '', seq) / len(seq)
            print('{} has GC percent: {}'.format(name, perc * 100))
            seq = []
        name = line.strip()

Upvotes: 1

Alain T.
Alain T.

Reputation: 42143

You could read the file line by line and accumulate sequence data up to the next line that starts with ">" (plus one more time for the end of the file)

def getCount(seq):
    return seq.count("G")+seq.count("C") 

with open("input.txt","r") as file:
    sequence = ""
    name     = ""
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            sequence += line
            continue
        if name != "":
            print(name, 100*getCount(sequence)/len(sequence))
        name     = line[1:]
        sequence = ""
    print(name, 100*getCount(sequence)/len(sequence))

# Rosalind_6404 53.75
# Rosalind_5959 53.57142857142857
# Rosalind_0808 60.91954022988506

Upvotes: 2

Chris_Rands
Chris_Rands

Reputation: 41168

Since you're looking for a Biopython solution, here is a very simple one:

from Bio import SeqIO
from Bio.SeqUtils import GC

for r in SeqIO.parse('Rosalind_gc.fa', 'fasta'):
    print(r.id, GC(r.seq))

Outputs:

Rosalind_6404 53.75
Rosalind_5959 53.57142857142857
Rosalind_0808 60.91954022988506

Upvotes: 2

Related Questions