Reputation: 87
I have a script that does some basic text cleaning and tokenizing and then counting and sorting word frequency. I'm able to get the script to work on individual files but I need help implementing it on an entire directory. So in short, I'd like to use this code to count the global word frequency across the entire directory (not return individual values for each file).
Here's my code:
import re
import string
from collections import Counter
file = open("german/test/polarity/positive/0.txt", mode="r", encoding="utf-8")
read_file = file.read()
#remove punctuation
translation = str.maketrans("","", string.punctuation)
stripped_file = read_file.translate(translation)
##lowercase
file_clean = stripped_file.lower()
##tokenize
file_tokens = file_clean.split()
##word count and sort
def word_count(file_tokens):
for word in file_tokens:
count = Counter(file_tokens)
return count
print(word_count(file_tokens))
Upvotes: 0
Views: 77
Reputation: 4367
For Python => 3.6 use os
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".txt"):
# print(os.path.join(directory, filename))
continue
else:
continue
Upvotes: 0