How to transform text from text file to library keys with word frequency values?

Question

I am trying to extract the information from four different text files with several keywords. I want to extract these keywords and attach the word frequency to the keywords. The text files look like this:

test1 = apple banana lemon
test2 = apple banana
test3 = lemon apple lemon
test4 =  apple lemon grape

I think there is an issue in the bolded code (second paragraph), I am not sure about how I should construct the initial dictionaries.

test1= [line.rstrip('
') for line in open("test1.txt")]
test2= [line.rstrip('
') for line in open("test2.txt")]
test3= [line.rstrip('
') for line in open("test3.txt")]
test4= [line.rstrip('
') for line in open("test4.txt")]

**
text_file = test1, test2, test3, test4
word_frequencies = 0
text_collection = {}
**

def dictionary(text):
    keywords = re.split(r'\W', text)
    print(text)
    word_frequencies = dict()
    for word in keyword:
        if word in word_frequences:
            word_frequences[word] += 1
        else:
            word_frequencies[word] = 1
    return word_frequencies

for all in text_file:
    file = open(all)
    text = file.read()
    print(file)
    text_collection[all] = dictionary(text)
print(text_collection)

Desired output:

{'test1.txt': {'apple': 1, 'banana': 1, 'lemon': 1},
'test2.txt': {'apple': 1, 'banana': 1},
'test3.txt': {'apple': 1, 'lemon': 2},
'test4.txt': {'apple': 1, 'lemon': 1, 'grape': 1}}

I would rather not use imported libraries as the answers. This code is more for practice than efficiency :)

DarrylG · Accepted Answer

With reuse the code from Efficiently count word frequencies in python with minor modifications

from collections import Counter
from itertools import chain
import pprint

def file_word_counts(filename):
    " Word count of file "
    # Use intertools.Counter to count words
    # Convert counter result to regular dict (i.e. dict(Counter(..))
    with open(filename) as f:
        return dict(Counter(chain.from_iterable(map(str.split, f))))

def file_counts(files):
  " Aggregate word count of muiltiple files into dictionary "
  return {filename:file_word_counts(filename) for filename in files}

# Show Results
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(file_counts(['test1.txt', 'test2.txt', 'test3.txt', 'test4.txt']))

Output

{   'test1.txt': {'apple': 1, 'banana': 1, 'lemon': 1},    
    'test2.txt': {'apple': 1, 'banana': 1},    
    'test3.txt': {'apple': 1, 'lemon': 2},
    'test4.txt': {'apple': 1, 'grape': 1, 'lemon': 1}}

Alternative

To produce the same without using additional modules

def file_counts(files):
  " Aggregate word count of muiltiple files into dictionary "
  return {filename:file_word_counts(filename) for filename in files}

def file_word_counts(filename):
    " Word count of file "
    count_ = {}
    with open(filename) as f:
      for line in f:
        for i in line.rstrip().split():
          count_.setdefault(i, 0)
          count_[i] += 1
      return count_

def file_counts(files):
  " Aggregate word count of muiltiple files into dictionary "
  return {filename:file_word_counts(filename) for filename in files}

print(file_counts(['test1.txt', 'test2.txt', 'test3.txt', 'test4.txt']))

How to transform text from text file to library keys with word frequency values?

Answers (1)

Related Questions