Reputation: 3768
To efficiently get the frequencies of letters (given alphabet ABC
in a dictionary in a string code
I can make a function a-la (Python 3) :
def freq(code):
return{n: code.count(n)/float(len(code)) for n in 'ABC'}
Then
code='ABBBC'
freq(code)
Gives me
{'A': 0.2, 'C': 0.2, 'B': 0.6}
But how can I get the frequencies for each position along a list of strings of unequal lengths ? For instance mcode=['AAB', 'AA', 'ABC', '']
should give me a nested structure like a list of dict (where each dict is the frequency per position):
[{'A': 1.0, 'C': 0.0, 'B': 0.0},
{'A': 0.66, 'C': 0.0, 'B': 0.33},
{'A': 0.0, 'C': 0.5, 'B': 0.5}]
I cannot figure out how to do the frequencies per position across all strings, and wrap this in a list comprehension. Inspired by other SO for word counts e.g. the well discussed post Python: count frequency of words in a list I believed maybe the Counter module from collections
might be a help.
Understand it like this - write the mcode strings on separate lines:
AAB
AA
ABC
Then what I need is the column-wise frequencies (AAA, AAB, BC) of the alphabet ABC in a list of dict where each list element is the frequencies of ABC per columns.
Upvotes: 1
Views: 1275
Reputation: 54303
Your code isn't efficient at all :
You could just use Counter
:
import itertools
from collections import Counter
mcode=['AAB', 'AA', 'ABC', '']
all_letters = set(''.join(mcode))
def freq(code):
code = [letter for letter in code if letter is not None]
n = len(code)
counter = Counter(code)
return {letter: counter[letter]/n for letter in all_letters}
print([freq(x) for x in itertools.zip_longest(*mcode)])
# [{'A': 1.0, 'C': 0.0, 'B': 0.0}, {'A': 0.6666666666666666, 'C': 0.0, 'B': 0.3333333333333333}, {'A': 0.0, 'C': 0.5, 'B': 0.5}]
For Python2, you could use itertools.izip_longest
.
Upvotes: 0
Reputation: 1822
A much shorter solution:
from itertools import zip_longest
def freq(code):
l = len(code) - code.count(None)
return {n: code.count(n)/l for n in 'ABC'}
mcode=['AAB', 'AA', 'ABC', '']
results = [ freq(code) for code in zip_longest(*mcode) ]
print(results)
Upvotes: 1
Reputation: 1708
Example, the steps are shortly explained in comments. Counter
of module collections
is not used, because the mapping for a position also contains characters, that are not present at this position and the order of frequencies does not seem to matter.
def freq(*words):
# All dictionaries contain all characters as keys, even
# if a characters is not present at a position.
# Create a sorted list of characters in chars.
chars = set()
for word in words:
chars |= set(word)
chars = sorted(chars)
# Get the number of positions.
max_position = max(len(word) for word in words)
# Initialize the result list of dictionaries.
result = [
dict((char, 0) for char in chars)
for position in range(max_position)
]
# Count characters.
for word in words:
for position in range(len(word)):
result[position][word[position]] += 1
# Change to frequencies
for position in range(max_position):
count = sum(result[position].values())
for char in chars:
result[position][char] /= count # float(count) for Python 2
return result
# Testing
from pprint import pprint
mcode = ['AAB', 'AA', 'ABC', '']
pprint(freq(*mcode))
Result (Python 3):
[{'A': 1.0, 'B': 0.0, 'C': 0.0},
{'A': 0.6666666666666666, 'B': 0.3333333333333333, 'C': 0.0},
{'A': 0.0, 'B': 0.5, 'C': 0.5}]
In Python 3.6, the dictionaries are even sorted; earlier versions can use OrderedDict
from collections
instead of dict
.
Upvotes: 1