movietime
movietime

Reputation: 101

Python Character Count in File

I was given a text file that gives the coding sequences for various proteins within a certain bacteria. The information comes in the form of a short description as well as the various amino acid coding sequences represented by capital letters. I have been asked to give a count for the various single letter amino acid codes in the form:

A: 1567
C: 8776
D: 6643
E: 3345
etc..

What I have so far:
I know it involves using Dicts and forloops, so I have written:

#!/usr/bin/python
ecoli = open("/file_pathway.txt").read()
counts = dict()
for line in ecoli:
    words = line.split()
    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

for key in counts:
    print key, counts[key]

I am just not how to edit the if statement to only include those particular uppercase letters I am searching for (i.e. A,C,D,E,L...)

Upvotes: 0

Views: 385

Answers (5)

dawg
dawg

Reputation: 103694

You could use a Counter

from collections import Counter

lets=Counter()
with open(ur_file, 'r') as f:
    for line in f:
        for c in line.strip():
            lets[c]+=1

Upvotes: 0

Vietnhi Phuvan
Vietnhi Phuvan

Reputation: 2804

In [1]: !cat test.dat AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

In [2]: inf = open('test.dat','r') #Create the python file object inf

In [3]: s = inf.readline() #read the entire file data into the string variable s

In [4]: [s.count(i) for i in 'ACDE'] #apply list comprehension to get the letter count
Out[4]: [156, 29, 20, 37]

In [5]: inf.close()

In [6]:

I am assuming that your amino acid sequence is written in the file data.dat as a string (no quotes) and you have nothing in the file except the amino acid sequence string. Result: the 'A' count is 156, the 'C' count is 29, etc. Note: the fact that test.dat shows a sorted order for the letters is purely coincidental and irrelevant. The sequence could have bem 'AEDC...' and the generated result would have been the same.

Upvotes: 0

J. Katzwinkel
J. Katzwinkel

Reputation: 1953

I like to omit the additional test of each word for being in the dict keys, by giving the default value 0 at lookup:

ecoli = open("/file_pathway.txt").read()
counts = dict()
for line in ecoli:
    for word in [w for w in line.split() if w in 'ACDEL']:
        counts[word] = counts.get(word,0) + 1

Upvotes: 0

sedavidw
sedavidw

Reputation: 11691

A couple of things I suggest here. One you can use collections to make a dictionary that you can just start adding to

from collections import defaultdict
counts = defaultdict(int)

Then you can just use

counts[word] += 1 #don't need to check if word already exists

If you know what words you are looking for keep them in a list

search_words = ['A', 'C' ...]

Then you can check if the word you care about is in there

if word in search_words:
     counts[word] += 1

Upvotes: 0

Kevin
Kevin

Reputation: 76184

Add another if so you only increment counts for accepted letters.

for word in words:
    if word in ["A", "C", "D", "E", "L"]:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

Upvotes: 1

Related Questions