movietime
movietime

Reputation: 101

Python Dict and Forloop with FASTA file

I was given a FASTA formatted file (like from this site: http://www.uniprot.org/proteomes/) that gives various protein coding sequences within a certain bacteria. I have been asked to give a full count and the relative percentage of each of the single code amino acids contained within the file, and return the results like:

L: 139002 (10.7%) 

A: 123885 (9.6%) 

G: 95475 (7.4%) 

V: 91683 (7.1%) 

I: 77836 (6.0%)

What I have so far:

 #!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
    words = line.split()
    for word in words:
        if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

for key in counts:
    print key, counts[key]

I believe that doing this is retrieving all of the instances of the capital letters and not just those contained within the protein amino acid string, how can I limit it just to the coding sequence? I am also having trouble writing how to calculate the each single code over the total

Upvotes: 1

Views: 1789

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

The only lines that don't contain what you want start with > just ignore those:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import defaultdict
    counts = defaultdict(int) 
    for line in ecoli: # iterate over file object, no need to read all contents into memory
        if line.startswith(">"): # skip lines that start with >
            continue
        for char in line: # just iterate over the characters in the line
            if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
                    counts[char] += 1
    total = float(sum(counts.values()))       
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

You could also use a collections.Counter dict as the lines only contain what you are interested in:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import Counter
    counts = Counter()
    for line in ecoli: # iterate over file object, no need to read all contents onto memory
        if line.startswith(">"): # skip lines that start with >
            continue
        counts.update(line.rstrip())
    total = float(sum(counts.values()))
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

Upvotes: 3

Dave J
Dave J

Reputation: 485

Using Counter makes it a bit easier and avoids the dictionary (I like dicts, but in this case, Counter really makes sense).

from collections import Counter
acids = ""                      # dunno if this is the right terminology
with open(filename, 'r') as ecoli_file:
    for line in ecoli_file:
        if line.startswith('>'):
            continue
        # from what I saw in the FASTA files, the character-check is
        # not necessary anymore...
        acids += line.strip()   # stripping newline and possible whitespaces
 counter = Counter(acids)       # and all the magic is done.
 total = float(sum(counter.values()))
 for k, v in counter.items():
     print "{}: {} ({:.1%})".format(k, v, v / total)

As Counter accepts iterables, it should be possible to do it with a generator:

from collections import Counter
with open(filename) as f:
    counter = Counter(c for line in f if not line.startswith('>')
                      for c in line.strip())
# and now as above
total = float(sum(counter.values()))
for k, v in counter.items():
    print "{}: {} ({:.1%})".format(k, v, v/total)

Upvotes: -1

user982835
user982835

Reputation: 85

You are correct that the way you are approaching this, you will count the instances of the character wherever they are, even in the description lines.

But your code will not even run, have you tried yet? You have line.split() but line is undefined (and numerous other errors). Additionally, by the time you are at "for string in ecoli:" you are already going character by character.

A simple way to approach this, is to read in the file, split on the newline character, skip lines beginning with ">", tally up the number of each character you care about and keep a running total of all characters analyzed.

#!/usr/bin/python
ecoli = open("/home/file_pathway.faa").read()
counts = dict()
nucleicAcids = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
for acid in nucleicAcids:
    counts[acid] = 0
total = 0

for line in ecoli.split('\n'):
    if ">" not in line:
        total += len(line)
        for acid in counts.keys():
            counts[acid] += line.count(acid)

Upvotes: 0

Related Questions