Reputation: 3
I am new in programming and I want to make a program that can count the frequency of words from a file. The expected output is as follows:
WORD FREQUENCY
in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x
I've almost got it working, but the word "list's" returns something like this:
list**â** - 1
s - 1
affecting the number of total words from the file.
I've been using regex like this:
match_pattern = re.findall(r"\w+", infile)
Upvotes: 0
Views: 1110
Reputation:
This is a solution that does not use regex.
I am assuming there are multiple sentences in the file. Take the whole content as docstring and use str.split()
function with split by space. You will get a list of words in that string.
Next you can use collections.Counter(list)
to get a dictionary which has keys as words and values as their frequency.
from collections import Counter
with open('file.txt') as f:
a = f.read()
b = dict(Counter(a.split(by = ' ')))
b is dictionary with the word-frequency pairs.
Note - Periods will always be kept with the last word in the sentence. You can ignore them in the results, or you can remove all periods first and then do the above procedure. Then the '.' used in abbreviations will also be removed, so it may not work like you want.
If you still want to use regex and match letters and apostrophe, try r"[a-zA-Z']+" and then use Counter. I will try to post some code for it when I get some time.
Upvotes: 0
Reputation: 27733
I'm guessing that a simple expression with a defaultdict
might work:
import re
from collections import defaultdict
regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)
words_dictionary = defaultdict(int)
for match in matches:
words_dictionary[match]+=1
print(words_dictionary)
['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']
defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})
lower()
import re
from collections import defaultdict
regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)
words_dictionary = defaultdict(int)
for match in matches:
words_dictionary[match.lower()]+=1
print(words_dictionary)
lower()
defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
for key,value in words_dictionary.items():
print(f'{key} - {value}')
some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2
Upvotes: 1
Reputation:
Instead of using:
match_pattern = re.findall(r"\w+", infile)
Try use:
match_pattern = re.findall(r"\S+", infile)
\w
maens a-z A-Z _ 0-9
\S
means any non space character.
Upvotes: 0