Reputation: 101
I was given a FASTA formatted file (like from this site: http://www.uniprot.org/proteomes/) that gives various protein coding sequences within a certain bacteria. I have been asked to give a full count and the relative percentage of each of the single code amino acids contained within the file, and return the results like:
L: 139002 (10.7%)
A: 123885 (9.6%)
G: 95475 (7.4%)
V: 91683 (7.1%)
I: 77836 (6.0%)
What I have so far:
#!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
words = line.split()
for word in words:
if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
for key in counts:
print key, counts[key]
I believe that doing this is retrieving all of the instances of the capital letters and not just those contained within the protein amino acid string, how can I limit it just to the coding sequence? I am also having trouble writing how to calculate the each single code over the total
Upvotes: 1
Views: 1789
Reputation: 180411
The only lines that don't contain what you want start with >
just ignore those:
with open("input.fasta") as ecoli: # will close your file automatically
from collections import defaultdict
counts = defaultdict(int)
for line in ecoli: # iterate over file object, no need to read all contents into memory
if line.startswith(">"): # skip lines that start with >
continue
for char in line: # just iterate over the characters in the line
if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
counts[char] += 1
total = float(sum(counts.values()))
for key,val in counts.items():
print("{}: {}, ({:.1%})".format(key,val, val / total))
You could also use a collections.Counter dict as the lines only contain what you are interested in:
with open("input.fasta") as ecoli: # will close your file automatically
from collections import Counter
counts = Counter()
for line in ecoli: # iterate over file object, no need to read all contents onto memory
if line.startswith(">"): # skip lines that start with >
continue
counts.update(line.rstrip())
total = float(sum(counts.values()))
for key,val in counts.items():
print("{}: {}, ({:.1%})".format(key,val, val / total))
Upvotes: 3
Reputation: 485
Using Counter makes it a bit easier and avoids the dictionary (I like dicts, but in this case, Counter
really makes sense).
from collections import Counter
acids = "" # dunno if this is the right terminology
with open(filename, 'r') as ecoli_file:
for line in ecoli_file:
if line.startswith('>'):
continue
# from what I saw in the FASTA files, the character-check is
# not necessary anymore...
acids += line.strip() # stripping newline and possible whitespaces
counter = Counter(acids) # and all the magic is done.
total = float(sum(counter.values()))
for k, v in counter.items():
print "{}: {} ({:.1%})".format(k, v, v / total)
As Counter
accepts iterables, it should be possible to do it with a generator:
from collections import Counter
with open(filename) as f:
counter = Counter(c for line in f if not line.startswith('>')
for c in line.strip())
# and now as above
total = float(sum(counter.values()))
for k, v in counter.items():
print "{}: {} ({:.1%})".format(k, v, v/total)
Upvotes: -1
Reputation: 85
You are correct that the way you are approaching this, you will count the instances of the character wherever they are, even in the description lines.
But your code will not even run, have you tried yet? You have line.split() but line is undefined (and numerous other errors). Additionally, by the time you are at "for string in ecoli:" you are already going character by character.
A simple way to approach this, is to read in the file, split on the newline character, skip lines beginning with ">", tally up the number of each character you care about and keep a running total of all characters analyzed.
#!/usr/bin/python
ecoli = open("/home/file_pathway.faa").read()
counts = dict()
nucleicAcids = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
for acid in nucleicAcids:
counts[acid] = 0
total = 0
for line in ecoli.split('\n'):
if ">" not in line:
total += len(line)
for acid in counts.keys():
counts[acid] += line.count(acid)
Upvotes: 0