Reputation: 369
I have made the following code and basically it outputs how often all characters showed up in a file named 'Test'.
from os import strerror
from collections import Counter
try:
with open ('Test', 'rt') as handle:
content = handle.read().lower().replace(' ', '').replace('\n', '')
counts = Counter(content)
for i in sorted(counts, key=lambda x: counts[x], reverse=True)[:30]:
print('{} -> {}'.format(i, counts[i]))
except IOError as e:
print('I/O error occurred: ', strerror(e.errno))
The output is:
e -> 383
o -> 247
s -> 226
t -> 224
n -> 219
a -> 217
r -> 201
i -> 188
d -> 127
h -> 125
l -> 112
c -> 112
m -> 105
u -> 72
f -> 59
p -> 59
g -> 58
y -> 48
b -> 47
. -> 36
w -> 35
, -> 35
v -> 28
k -> 25
0 -> 15
- -> 9
% -> 8
1 -> 7
’ -> 7
x -> 7
Afterward I realized I just need the alphabets. I figured I have to modify line #6:
content = handle.read().lower().replace(' ', '').replace('\n', '')
I am aware I could just create a for-loop and using following conditional expresstion: str.isalpha()
to remove non-alphabetic.
I wonder if there's other better ways to do that.
Thank you in advance for your feedback:-)
Upvotes: 0
Views: 201
Reputation: 912
You can replace this line:
content = handle.read().lower().replace(' ', '').replace('\n', '')
By this regex one-liner:
import re
content = re.sub("[^a-z/-]+", "", handle.read().lower())
In this way you'll remove spaces, newline and non-alphabetic characters in a single pass.
Upvotes: 1
Reputation: 73470
You can do it all in one go, using a generator expression or filter
:
counts = Counter(filter(str.isalpha, handle.read().lower()))
Btw, you should also consider using Counter.most_common
for your output:
for k, n in counts.most_common(30):
print('{} -> {}'.format(k, n))
Upvotes: 2