Reputation: 61
I am trying to extract the information from four different text files with several keywords. I want to extract these keywords and attach the word frequency to the keywords. The text files look like this:
test1 = apple banana lemon
test2 = apple banana
test3 = lemon apple lemon
test4 = apple lemon grape
I think there is an issue in the bolded code (second paragraph), I am not sure about how I should construct the initial dictionaries.
test1= [line.rstrip('\n') for line in open("test1.txt")]
test2= [line.rstrip('\n') for line in open("test2.txt")]
test3= [line.rstrip('\n') for line in open("test3.txt")]
test4= [line.rstrip('\n') for line in open("test4.txt")]
**
text_file = test1, test2, test3, test4
word_frequencies = 0
text_collection = {}
**
def dictionary(text):
keywords = re.split(r'\W', text)
print(text)
word_frequencies = dict()
for word in keyword:
if word in word_frequences:
word_frequences[word] += 1
else:
word_frequencies[word] = 1
return word_frequencies
for all in text_file:
file = open(all)
text = file.read()
print(file)
text_collection[all] = dictionary(text)
print(text_collection)
Desired output:
{'test1.txt': {'apple': 1, 'banana': 1, 'lemon': 1},
'test2.txt': {'apple': 1, 'banana': 1},
'test3.txt': {'apple': 1, 'lemon': 2},
'test4.txt': {'apple': 1, 'lemon': 1, 'grape': 1}}
I would rather not use imported libraries as the answers. This code is more for practice than efficiency :)
Upvotes: 1
Views: 460
Reputation: 17166
With reuse the code from Efficiently count word frequencies in python with minor modifications
from collections import Counter
from itertools import chain
import pprint
def file_word_counts(filename):
" Word count of file "
# Use intertools.Counter to count words
# Convert counter result to regular dict (i.e. dict(Counter(..))
with open(filename) as f:
return dict(Counter(chain.from_iterable(map(str.split, f))))
def file_counts(files):
" Aggregate word count of muiltiple files into dictionary "
return {filename:file_word_counts(filename) for filename in files}
# Show Results
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(file_counts(['test1.txt', 'test2.txt', 'test3.txt', 'test4.txt']))
Output
{ 'test1.txt': {'apple': 1, 'banana': 1, 'lemon': 1},
'test2.txt': {'apple': 1, 'banana': 1},
'test3.txt': {'apple': 1, 'lemon': 2},
'test4.txt': {'apple': 1, 'grape': 1, 'lemon': 1}}
Alternative
To produce the same without using additional modules
def file_counts(files):
" Aggregate word count of muiltiple files into dictionary "
return {filename:file_word_counts(filename) for filename in files}
def file_word_counts(filename):
" Word count of file "
count_ = {}
with open(filename) as f:
for line in f:
for i in line.rstrip().split():
count_.setdefault(i, 0)
count_[i] += 1
return count_
def file_counts(files):
" Aggregate word count of muiltiple files into dictionary "
return {filename:file_word_counts(filename) for filename in files}
print(file_counts(['test1.txt', 'test2.txt', 'test3.txt', 'test4.txt']))
Upvotes: 2