Python writerows only writes the last row of NLTK FreqDist to a csv file

Question

I have been writing Python code to find the frequency distribution of words contained in a text document using words contained in a Python list (word_list) The program calculates the frequency distribution and I can print them to the screen, but when I attempt to write the frequency distribution to a .csv file it only writes the last row of FreqDist repeatedly for however many text files are in the directory. My code is as follows

CIK_List = []


for filename in glob.glob(os.path.join(test_path, '*.txt')):

 CIK = re.search(r"\_([0-9]+)\_", filename) # extract the CIK from the filename

 path = nltk.data.find(filename)
 raw = open(path, 'r').read()

 tokens = word_tokenize(raw)
 words = [h.lower() for h in tokens]
 f_dist = nltk.FreqDist([s.lower() for s in words])
 print(f_dist)

 wordcount = collections.Counter()

 CIK_List.append(CIK) 
 with open(file_path, 'w+', newline= '') as csv_file:
  writer = csv.writer(csv_file)
  writer.writerow(["CIK"] + word_list)
  for m in word_list:
    print([CIK.group(1)], [f_dist[m]], end='')

  for val in CIK_List:
     writer.writerows(([val.group(1)] + [f_dist[m] for m in word_list],))

Luke Woodward · Accepted Answer

The problem is that for every input file you read, you create the output file and write the

Take a look at the following loop at the end of the code. What does it do?

  for val in CIK_List:
     writer.writerows(([val.group(1)] + [f_dist[m] for m in word_list],))

CIK_List is a list of regexp matches. For each such regexp match, we write out the first matching group (which is the numeric part of the filename), and then we write out something that does not depend on val. So as val runs through the list of regexp matches, you get the same output time and time again.

You are also opening the file several times, once per input file, and every time you open the file, you throw away the contents that were there previously.

What you probably want to do is to open the output file once, write out the header row, and then, for each input file, write a single row to the output file based on the contents of that input file:

CIK_List = []
with open(file_path, 'w+', newline= '') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(["CIK"] + word_list)

    for filename in glob.glob(os.path.join(test_path, '*.txt')):

        CIK = re.search(r"\_([0-9]+)\_", filename) # extract the CIK from the filename

        path = nltk.data.find(filename)
        raw = open(path, 'r').read()
        
        tokens = word_tokenize(raw)
        words = [h.lower() for h in tokens]
        f_dist = nltk.FreqDist([s.lower() for s in words])
        print(f_dist)
        
        wordcount = collections.Counter()

        CIK_List.append(CIK) 
        for m in word_list:
            print([CIK.group(1)], [f_dist[m]], end='')

        writer.writerow([CIK.group(1)] + [f_dist[m] for m in word_list])

Python writerows only writes the last row of NLTK FreqDist to a csv file

Answers (1)

Related Questions