Reputation: 59
I have been writing Python code to find the frequency distribution of words contained in a text document using words contained in a Python list (word_list
) The program calculates the frequency distribution and I can print them to the screen, but when I attempt to write the frequency distribution to a .csv file it only writes the last row of FreqDist
repeatedly for however many text files are in the directory. My code is as follows
CIK_List = []
for filename in glob.glob(os.path.join(test_path, '*.txt')):
CIK = re.search(r"\_([0-9]+)\_", filename) # extract the CIK from the filename
path = nltk.data.find(filename)
raw = open(path, 'r').read()
tokens = word_tokenize(raw)
words = [h.lower() for h in tokens]
f_dist = nltk.FreqDist([s.lower() for s in words])
print(f_dist)
wordcount = collections.Counter()
CIK_List.append(CIK)
with open(file_path, 'w+', newline= '') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["CIK"] + word_list)
for m in word_list:
print([CIK.group(1)], [f_dist[m]], end='')
for val in CIK_List:
writer.writerows(([val.group(1)] + [f_dist[m] for m in word_list],))
Upvotes: 0
Views: 181
Reputation: 64969
The problem is that for every input file you read, you create the output file and write the
Take a look at the following loop at the end of the code. What does it do?
for val in CIK_List:
writer.writerows(([val.group(1)] + [f_dist[m] for m in word_list],))
CIK_List
is a list of regexp matches. For each such regexp match, we write out the first matching group (which is the numeric part of the filename), and then we write out something that does not depend on val
. So as val
runs through the list of regexp matches, you get the same output time and time again.
You are also opening the file several times, once per input file, and every time you open the file, you throw away the contents that were there previously.
What you probably want to do is to open the output file once, write out the header row, and then, for each input file, write a single row to the output file based on the contents of that input file:
CIK_List = []
with open(file_path, 'w+', newline= '') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["CIK"] + word_list)
for filename in glob.glob(os.path.join(test_path, '*.txt')):
CIK = re.search(r"\_([0-9]+)\_", filename) # extract the CIK from the filename
path = nltk.data.find(filename)
raw = open(path, 'r').read()
tokens = word_tokenize(raw)
words = [h.lower() for h in tokens]
f_dist = nltk.FreqDist([s.lower() for s in words])
print(f_dist)
wordcount = collections.Counter()
CIK_List.append(CIK)
for m in word_list:
print([CIK.group(1)], [f_dist[m]], end='')
writer.writerow([CIK.group(1)] + [f_dist[m] for m in word_list])
Upvotes: 1